On 6/12/2025 6:36 AM, Jonathan Cameron wrote: > On Tue, 3 Jun 2025 12:22:27 -0500 > Terry Bowman <terry.bowman@xxxxxxx> wrote: > >> The AER driver is now designed to forward CXL protocol errors to the CXL >> driver. Update the CXL driver with functionality to dequeue the forwarded >> CXL error from the kfifo. Also, update the CXL driver to begin the protocol >> error handling processing using the work received from the FIFO. >> >> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the >> AER service driver. This will begin the CXL protocol error processing >> with a call to cxl_handle_prot_error(). >> >> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was >> previously in the AER driver. >> >> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info' >> and use in discovering the erring PCI device. Make scope based reference >> increments/decrements for the discovered PCI device and the associated >> CXL device. >> >> Implement cxl_handle_prot_error() to differentiate between Restricted CXL >> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. >> RCH errors will be processed with a call to walk the associated Root >> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex >> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec() >> so the CXL driver can walk the RCEC's downstream bus, searching for >> the RCiEP. >> >> VH correctable error (CE) processing will call the CXL CE handler. VH >> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a >> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status() >> and pci_clean_device_status() used to clean up AER status after handling. >> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > Hopefully I haven't duplicated existing feedback. A few more minor things > inline. > >> + >> +static struct pci_dev *sbdf_to_pci(struct cxl_prot_error_info *err_info) > That's an odd function name as the sbdf is buried. > Unless it's going to get a lot of use, I'd just put the lookup > inline and not have a 'hard to name' function. With other suggested > changes you only need (at where this is currently called) > struct pci_dev *pdev __free(pci_dev_put) = > pci_get_domain_bus_and_slot(err_info->segment, err_info->bus, > err_info->devfn); Ok. I'll remove the helper and add inline. >> +{ >> + unsigned int devfn = PCI_DEVFN(err_info->device, >> + err_info->function); > This makes me wonder if you should have just use devfn inside the err_info. > Is there a good reason to split them up before storing them? > > IIRC ARI makes a mess anyway of their meaning when separate. Agreed. I'll keep devfn together. >> + struct pci_dev *pdev __free(pci_dev_put) = >> + pci_get_domain_bus_and_slot(err_info->segment, >> + err_info->bus, >> + devfn); >> + return pdev; >> +} >> + >> +static void cxl_handle_prot_error(struct cxl_prot_error_info *err_info) >> +{ >> + struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(sbdf_to_pci(err_info)); >> + >> + if (!pdev) { >> + pr_err("Failed to find the CXL device\n"); >> + return; >> + } >> + >> + /* >> + * Internal errors of an RCEC indicate an AER error in an >> + * RCH's downstream port. Check and handle them in the CXL.mem >> + * device driver. >> + */ >> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC) >> + return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info); >> + >> + if (err_info->severity == AER_CORRECTABLE) { >> + int aer = pdev->aer_cap; >> + struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); >> + struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev); >> + >> + if (aer) >> + pci_clear_and_set_config_dword(pdev, >> + aer + PCI_ERR_COR_STATUS, >> + 0, PCI_ERR_COR_INTERNAL); >> + >> + cxl_cor_error_detected(pdev); >> + >> + pcie_clear_device_status(pdev); >> + } else { >> + cxl_do_recovery(pdev); >> + } >> +} >> + >> static void cxl_prot_err_work_fn(struct work_struct *work) >> { >> + struct cxl_prot_err_work_data wd; >> + >> + while (cxl_prot_err_kfifo_get(&wd)) { >> + struct cxl_prot_error_info *err_info = &wd.err_info; >> + >> + cxl_handle_prot_error(err_info); > I'm not seeing value in the local variable. > Ignore if this code changes later and that makes more sense! It can be removed and simplified. Thanks for reviewing. -Terry >> + } >> } >> >> #else