On 4/4/2025 12:04 PM, Jonathan Cameron wrote: > On Wed, 26 Mar 2025 20:47:17 -0500 > Terry Bowman <terry.bowman@xxxxxxx> wrote: > >> During CXL device cleanup the CXL PCIe Port device interrupts may remain >> enabled. This can potentialy allow unnecessary interrupt processing on >> behalf of the CXL errors while the device is destroyed. >> >> Disable CXL protocol errors by setting the CXL devices' AER mask register. >> >> Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors(). >> >> Next, introduce cxl_disable_prot_errors() to call pci_aer_mask_internal_errors(). >> Register cxl_disable_prot_errors() to run at CXL device cleanup. >> Register for CXL Root Ports, CXL Downstream Ports, CXL Upstream Ports, and >> CXL Endpoints. >> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > A few small comments in here. I haven't looked through all the rest of the series > as out of time today but this one caught my eye. >> >> @@ -223,7 +238,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port) >> struct device *cxlmd_dev __free(put_device) = &cxlmd->dev; >> struct cxl_dev_state *cxlds = cxlmd->cxlds; >> >> - if (!dport || !dev_is_pci(dport->dport_dev)) { >> + if (!dport || !dev_is_pci(dport->dport_dev) || !dev_is_pci(cxlds->dev)) { >> dev_err(&port->dev, "CXL port topology not found\n"); >> return; >> } >> @@ -232,6 +247,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port) >> >> cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers); >> cxl_enable_prot_errors(cxlds->dev); >> + devm_add_action_or_reset(cxlds->dev, cxl_disable_prot_errors, cxlds->dev); > This can fail (at least in theory). Should at least scream that oddly we've > disabled error handling interrupts if it is hard to return anything cleanly. > > Same for all the other cases. Ok. I will add a dev_err() for errors returned by devm_add_action_or_reset(). >> } >> >> #else >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index d3068f5cc767..d1ef0c676ff8 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev) >> } >> EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL"); >> >> +/** >> + * pci_aer_mask_internal_errors - mask internal errors >> + * @dev: pointer to the pcie_dev data structure >> + * >> + * Masks internal errors in the Uncorrectable and Correctable Error >> + * Mask registers. >> + * >> + * Note: AER must be enabled and supported by the device which must be >> + * checked in advance, e.g. with pcie_aer_is_native(). >> + */ >> +void pci_aer_mask_internal_errors(struct pci_dev *dev) >> +{ >> + int aer = dev->aer_cap; >> + u32 mask; >> + >> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask); >> + mask |= PCI_ERR_UNC_INTN; >> + pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask); >> + > It does an extra clear we don't need, but.... > pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, > 0, PCI_ERR_UNC_INTN); > > is at very least shorter than the above 3 lines. Doing so will overwrite the existing mask. CXL normally only uses AER UIE/CIE but if the device happens to lose alternate training and no longer identifies as a CXL device than this mask value would be critical for reporting PCI AER errors and would need UCE/CE enabled (other than UIE/CIE). -Terry >> + pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask); >> + mask |= PCI_ERR_COR_INTERNAL; >> + pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask); >> +} >> +EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL"); >> + >> static bool is_cxl_mem_dev(struct pci_dev *dev) >> { >> /* >> diff --git a/include/linux/aer.h b/include/linux/aer.h >> index a65fe324fad2..f0c84db466e5 100644 >> --- a/include/linux/aer.h >> +++ b/include/linux/aer.h >> @@ -101,5 +101,6 @@ int cper_severity_to_aer(int cper_severity); >> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn, >> int severity, struct aer_capability_regs *aer_regs); >> void pci_aer_unmask_internal_errors(struct pci_dev *dev); >> +void pci_aer_mask_internal_errors(struct pci_dev *dev); >> #endif //_AER_H_ >>