On 6/6/2025 10:57 AM, Dave Jiang wrote: > > On 6/3/25 10:22 AM, Terry Bowman wrote: >> The AER driver is now designed to forward CXL protocol errors to the CXL >> driver. Update the CXL driver with functionality to dequeue the forwarded >> CXL error from the kfifo. Also, update the CXL driver to begin the protocol >> error handling processing using the work received from the FIFO. >> >> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the >> AER service driver. This will begin the CXL protocol error processing >> with a call to cxl_handle_prot_error(). >> >> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was >> previously in the AER driver. >> >> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info' >> and use in discovering the erring PCI device. Make scope based reference >> increments/decrements for the discovered PCI device and the associated >> CXL device. >> >> Implement cxl_handle_prot_error() to differentiate between Restricted CXL >> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. >> RCH errors will be processed with a call to walk the associated Root >> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex >> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec() >> so the CXL driver can walk the RCEC's downstream bus, searching for >> the RCiEP. >> >> VH correctable error (CE) processing will call the CXL CE handler. VH >> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a >> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status() >> and pci_clean_device_status() used to clean up AER status after handling. >> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> >> --- >> drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++ >> drivers/pci/pci.c | 1 + >> drivers/pci/pci.h | 8 ---- >> drivers/pci/pcie/aer.c | 1 + >> drivers/pci/pcie/rcec.c | 1 + >> include/linux/aer.h | 2 + >> include/linux/pci.h | 10 +++++ >> 7 files changed, 107 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c >> index d35525e79e04..9ed5c682e128 100644 >> --- a/drivers/cxl/core/ras.c >> +++ b/drivers/cxl/core/ras.c >> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn); >> >> #ifdef CONFIG_PCIEAER_CXL >> >> +static void cxl_do_recovery(struct pci_dev *pdev) >> +{ >> +} >> + >> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data) >> +{ >> + struct cxl_prot_error_info *err_info = data; >> + struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev); >> + struct cxl_dev_state *cxlds; >> + >> + /* >> + * The capability, status, and control fields in Device 0, >> + * Function 0 DVSEC control the CXL functionality of the >> + * entire device (CXL 3.0, 8.1.3). >> + */ >> + if (pdev->devfn != PCI_DEVFN(0, 0)) >> + return 0; >> + >> + /* >> + * CXL Memory Devices must have the 502h class code set (CXL >> + * 3.0, 8.1.12.1). >> + */ >> + if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL) > Should use FIELD_GET() to be consistent with the rest of CXL code base > >> + return 0; >> + >> + if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver) > I think you need to hold the pdev->dev lock while checking if the driver exists. Hi Dave, Wouldn't a reference count increment prevent the driver from being unbound and thus make this access here to the driver safe (given the pci_dev_get() above)? And a lock would prevent concurrent access with a busy wait when the driver executes the next lock take? Terry [snip]