Bowman, Terry wrote: > On 7/23/2025 9:01 PM, dan.j.williams@xxxxxxxxx wrote: > > Terry Bowman wrote: > >> CXL error handling will soon be moved from the AER driver into the CXL > >> driver. This requires a notification mechanism for the AER driver to share > >> the AER interrupt with the CXL driver. The notification will be used > >> as an indication for the CXL drivers to handle and log the CXL RAS errors. > >> > >> First, introduce cxl/core/native_ras.c to contain changes for the CXL > >> driver's RAS native handling. This as an alternative to dropping the > >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs. > >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to > >> conditionally compile the new file. > > I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those > > needs to subsume the other. I also do not understand the point of > > "NATIVE" in the name. Will not CPER notified protocol errors be routed > > to the same CXL error handling infrastructure as AER notified protocol > > errors? I.e. the aer_recover_queue() path? > > This change and comment is planned to be removed in v11. Instead of introducing this > as a new file. The same changes will instead be added to pci_aer.c/pci_ras.c Dave Jiang > is introducing here: > > https://lore.kernel.org/linux-cxl/20250721170415.285961-1-dave.jiang@xxxxxxxxx/ Lets just put it all in cxl/core/ras.c, I don't think we need to have fine grained file distinctions between the "native", "aer", and "ras component registers" cases. "Want CXL error handling? Need ras.c." [..] <trim three pages of context that had no new comments to respond, please trim your replies> > >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > >> index 0b4d721980ef..8417a49c71f2 100644 > >> --- a/drivers/pci/pcie/aer.c > >> +++ b/drivers/pci/pcie/aer.c > >> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > >> > >> static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info) > >> { > >> - cxl_rch_handle_error(dev, info); > > No, can not just drop what was working before, even if you restore the > > functionality in a later patch in the same series. > > > > I would expect that this patch at a minimum maintains RCH handling and > > forwards anything else to the CXL core for VH handling. > > You want RCH handling to stay here in the AER driver or does the patch changes need > to be reworked to present the change better? Stay here. The cost of exporting PCI core functionality for this one-off seems not worth it. > >> - pci_aer_handle_error(dev, info); > >> + if (is_cxl_error(dev, info)) > >> + forward_cxl_error(dev, info); > >> + else > >> + pci_aer_handle_error(dev, info); > >> + > >> pci_dev_put(dev); > >> } > >> > >> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c > >> index b2ea14f70055..846ab55d747c 100644 > >> --- a/drivers/pci/pcie/cxl_aer.c > >> +++ b/drivers/pci/pcie/cxl_aer.c > > With the RCH bits moved to its own file then this file would be 100% > > concerned with typical CXL VH error handling and deserve to carry the > > "cxl_aer.c" name. > The plan was to move all the handling to cxl/core/pci_aer.c or whatever it is renamed. It would need more justification to overcome the perception of "new exports for code that is already on a deprecation watch" [..] > >> diff --git a/include/linux/aer.h b/include/linux/aer.h > >> index 02940be66324..24c3d9e18ad5 100644 > >> --- a/include/linux/aer.h > >> +++ b/include/linux/aer.h > >> @@ -10,6 +10,7 @@ > >> > >> #include <linux/errno.h> > >> #include <linux/types.h> > >> +#include <linux/workqueue_types.h> > >> > >> #define AER_NONFATAL 0 > >> #define AER_FATAL 1 > >> @@ -53,6 +54,26 @@ struct aer_capability_regs { > >> u16 uncor_err_source; > >> }; > >> > >> +/** > >> + * struct cxl_proto_err_info - Error information used in CXL error handling > >> + * @severity: AER severity > >> + * @function: Device's PCI function > >> + * @device: Device's PCI device > >> + * @bus: Device's PCI bus > >> + * @segment: Device's PCI segment > >> + */ > >> +struct cxl_proto_error_info { > >> + int severity; > >> + > >> + u8 devfn; > >> + u8 bus; > >> + u16 segment; > >> +}; > >> + > >> +struct cxl_proto_err_work_data { > >> + struct cxl_proto_error_info err_info; > >> +}; > > Why not use cxl_proto_error_info directly? > At one point I thought there was a good reason for using it later in another case. > I'll use cxl_proto_error_info directly. Maybe that was more the CPER case where the CXL standard mailbox payload structure is wrapped with a CPER-record envelope? In any event, I think it is safe to drop that indirection.