Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 6/6/25 4:15 PM, Bowman, Terry wrote:
> 
> 
> On 6/6/2025 10:57 AM, Dave Jiang wrote:
>>
>> On 6/3/25 10:22 AM, Terry Bowman wrote:
>>> The AER driver is now designed to forward CXL protocol errors to the CXL
>>> driver. Update the CXL driver with functionality to dequeue the forwarded
>>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
>>> error handling processing using the work received from the FIFO.
>>>
>>> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
>>> AER service driver. This will begin the CXL protocol error processing
>>> with a call to cxl_handle_prot_error().
>>>
>>> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
>>> previously in the AER driver.
>>>
>>> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
>>> and use in discovering the erring PCI device. Make scope based reference
>>> increments/decrements for the discovered PCI device and the associated
>>> CXL device.
>>>
>>> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
>>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
>>> RCH errors will be processed with a call to walk the associated Root
>>> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
>>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>>> so the CXL driver can walk the RCEC's downstream bus, searching for
>>> the RCiEP.
>>>
>>> VH correctable error (CE) processing will call the CXL CE handler. VH
>>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
>>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
>>> and pci_clean_device_status() used to clean up AER status after handling.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>>> ---
>>>  drivers/cxl/core/ras.c  | 92 +++++++++++++++++++++++++++++++++++++++++
>>>  drivers/pci/pci.c       |  1 +
>>>  drivers/pci/pci.h       |  8 ----
>>>  drivers/pci/pcie/aer.c  |  1 +
>>>  drivers/pci/pcie/rcec.c |  1 +
>>>  include/linux/aer.h     |  2 +
>>>  include/linux/pci.h     | 10 +++++
>>>  7 files changed, 107 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index d35525e79e04..9ed5c682e128 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>>  
>>>  #ifdef CONFIG_PCIEAER_CXL
>>>  
>>> +static void cxl_do_recovery(struct pci_dev *pdev)
>>> +{
>>> +}
>>> +
>>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>>> +{
>>> +	struct cxl_prot_error_info *err_info = data;
>>> +	struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev);
>>> +	struct cxl_dev_state *cxlds;
>>> +
>>> +	/*
>>> +	 * The capability, status, and control fields in Device 0,
>>> +	 * Function 0 DVSEC control the CXL functionality of the
>>> +	 * entire device (CXL 3.0, 8.1.3).
>>> +	 */
>>> +	if (pdev->devfn != PCI_DEVFN(0, 0))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * CXL Memory Devices must have the 502h class code set (CXL
>>> +	 * 3.0, 8.1.12.1).
>>> +	 */
>>> +	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
>> Should use FIELD_GET() to be consistent with the rest of CXL code base
>>
>>> +		return 0;
>>> +
>>> +	if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver)
>> I think you need to hold the pdev->dev lock while checking if the driver exists.
> Hi Dave,
> 
> Wouldn't a reference count increment prevent the driver from being unbound and thus
> make this access here to the driver safe (given the pci_dev_get() above)? And a lock
> would prevent concurrent access with a busy wait when the driver executes the next
> lock take?

Actually nothing prevents a driver from being unbound unless you are holding the device lock. Because device core needs the device lock in order to call driver removal [1]. So if you acquire the lock, either the driver is still bound and you are ok, or the driver is gone and there's nothing to do. The incremented refcount prevents ->release() of the device and the memory allocated for the device from being freed based on kref behavior [2].

[1]: https://elixir.bootlin.com/linux/v6.15.1/source/drivers/base/dd.c#L1292
[2]: https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/kref.h#L48

DJ

> 
> Terry
> 
> [snip]
> 





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux