On 6/6/25 4:15 PM, Bowman, Terry wrote: > > > On 6/6/2025 10:57 AM, Dave Jiang wrote: >> >> On 6/3/25 10:22 AM, Terry Bowman wrote: >>> The AER driver is now designed to forward CXL protocol errors to the CXL >>> driver. Update the CXL driver with functionality to dequeue the forwarded >>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol >>> error handling processing using the work received from the FIFO. >>> >>> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the >>> AER service driver. This will begin the CXL protocol error processing >>> with a call to cxl_handle_prot_error(). >>> >>> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was >>> previously in the AER driver. >>> >>> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info' >>> and use in discovering the erring PCI device. Make scope based reference >>> increments/decrements for the discovered PCI device and the associated >>> CXL device. >>> >>> Implement cxl_handle_prot_error() to differentiate between Restricted CXL >>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. >>> RCH errors will be processed with a call to walk the associated Root >>> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex >>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec() >>> so the CXL driver can walk the RCEC's downstream bus, searching for >>> the RCiEP. >>> >>> VH correctable error (CE) processing will call the CXL CE handler. VH >>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a >>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status() >>> and pci_clean_device_status() used to clean up AER status after handling. >>> >>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> >>> --- >>> drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++ >>> drivers/pci/pci.c | 1 + >>> drivers/pci/pci.h | 8 ---- >>> drivers/pci/pcie/aer.c | 1 + >>> drivers/pci/pcie/rcec.c | 1 + >>> include/linux/aer.h | 2 + >>> include/linux/pci.h | 10 +++++ >>> 7 files changed, 107 insertions(+), 8 deletions(-) >>> >>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c >>> index d35525e79e04..9ed5c682e128 100644 >>> --- a/drivers/cxl/core/ras.c >>> +++ b/drivers/cxl/core/ras.c >>> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn); >>> >>> #ifdef CONFIG_PCIEAER_CXL >>> >>> +static void cxl_do_recovery(struct pci_dev *pdev) >>> +{ >>> +} >>> + >>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data) >>> +{ >>> + struct cxl_prot_error_info *err_info = data; >>> + struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev); >>> + struct cxl_dev_state *cxlds; >>> + >>> + /* >>> + * The capability, status, and control fields in Device 0, >>> + * Function 0 DVSEC control the CXL functionality of the >>> + * entire device (CXL 3.0, 8.1.3). >>> + */ >>> + if (pdev->devfn != PCI_DEVFN(0, 0)) >>> + return 0; >>> + >>> + /* >>> + * CXL Memory Devices must have the 502h class code set (CXL >>> + * 3.0, 8.1.12.1). >>> + */ >>> + if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL) >> Should use FIELD_GET() to be consistent with the rest of CXL code base >> >>> + return 0; >>> + >>> + if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver) >> I think you need to hold the pdev->dev lock while checking if the driver exists. > Hi Dave, > > Wouldn't a reference count increment prevent the driver from being unbound and thus > make this access here to the driver safe (given the pci_dev_get() above)? And a lock > would prevent concurrent access with a busy wait when the driver executes the next > lock take? Actually nothing prevents a driver from being unbound unless you are holding the device lock. Because device core needs the device lock in order to call driver removal [1]. So if you acquire the lock, either the driver is still bound and you are ok, or the driver is gone and there's nothing to do. The incremented refcount prevents ->release() of the device and the memory allocated for the device from being freed based on kref behavior [2]. [1]: https://elixir.bootlin.com/linux/v6.15.1/source/drivers/base/dd.c#L1292 [2]: https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/kref.h#L48 DJ > > Terry > > [snip] >