On Thu, 15 May 2025 16:52:15 -0500 "Bowman, Terry" <terry.bowman@xxxxxxx> wrote: > On 4/25/2025 8:18 AM, Jonathan Cameron wrote: > > On Thu, 24 Apr 2025 09:17:45 -0500 > > "Bowman, Terry" <terry.bowman@xxxxxxx> wrote: > > > >> On 4/23/2025 10:04 AM, Jonathan Cameron wrote: > >>> On Wed, 26 Mar 2025 20:47:05 -0500 > >>> Terry Bowman <terry.bowman@xxxxxxx> wrote: > >>> > >>>> The AER service driver includes a CXL-specific kfifo, intended to forward > >>>> CXL errors to the CXL driver. However, the forwarding functionality is > >>>> currently unimplemented. Update the AER driver to enable error forwarding > >>>> to the CXL driver. > >>>> > >>>> Modify the AER service driver's handle_error_source(), which is called from > >>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors. > >>>> > >>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it > >>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error > >>>> masks. > >>>> > >>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error() > >>>> as done in the current AER driver. > >>>> > >>>> If the error is a CXL-related error then forward it to the CXL driver for > >>>> handling using the kfifo mechanism. > >>>> > >>>> Introduce a new function forward_cxl_error(), which constructs a CXL > >>>> protocol error context using cxl_create_prot_err_info(). This context is > >>>> then passed to the CXL driver via kfifo using a 'struct work_struct'. > >>>> > >>>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx> > >>> Hi Terry, > >>> > >>> Finally got back to this. I'm not following how some of the reference > >>> counting in here is working. It might be fine but there is a lot > >>> taking then dropping device references - some of which are taken again later. > >>> > >>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec) > >>>> pci_info(rcec, "CXL: Internal errors unmasked"); > >>>> } > >>>> > >>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info) > >>>> +{ > >>>> + int severity = info->severity; > >>> So far this variable isn't really justified. Maybe it makes sense later in the > >>> series? > >> This is used below in call to cxl_create_prot_err_info(). > > Sure, but why not just do > > > > if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) { > > > > There isn't anything modifying info->severity in between so that local > > variable is just padding out the code to no real benefit. > > > > > >>>> + pci_err(pdev, "Failed to create CXL protocol error information"); > >>>> + return; > >>>> + } > >>>> + > >>>> + struct device *cxl_dev __free(put_device) = get_device(err_info->dev); > >>> Also this one. A reference was acquired and dropped in cxl_create_prot_err_info() > >>> followed by retaking it here. How do we know it is still about by this call > >>> and once we pull it off the kfifo later? > >> Yes, this is a problem I realized after sending the series. > >> > >> The device reference incr could be changed for all the devices to the non-cleanup > >> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info(). > >> I need to look at the other calls to to cxl_create_prot_err_info() as well. > >> > >> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info. > >> This would eliminate the need for further accesses to the CXL device after being dequeued from the > >> fifo. Thoughts? > > That sounds like a reasonable solution to me. > > > > Jonathan > Hi Jonathan, Hi Terry, Sorry for delay - travel etc... > > Is it sufficient to rely on correctly implemented reference counting implementation instead > of caching the RAS status I mentioned earlier? > > I have the next revision coded to 'get' the CXL erring device's reference count in the AER > driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver > after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading > the RAS status and caching it. One more question: is it OK to implement the get and put (of > the same object) in different drivers? It's definitely unusual. If there is anything similar to point at I'd be happier than this 'innovation' showing up here first. > > If we need to read and cache the RAS status before the kfifo enqueue there will be some other > details to work through. This still smells like the cleaner solution to me, but depends on those details.. Jonathan > > -Terry > >