Re: [PATCH v5 6/8] PCI/AER: Introduce ratelimit for error logs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21/03/2025 02:58, Jon Pan-Doh wrote:
>
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
-		     const char *level)
+		     const char *level, bool ratelimited)

Ideally, we would like to be able to extract the "ratelimited" flag from the aer_err_info struct, with no need for extra parameters in this function.

  static void aer_print_rp_info(struct pci_dev *rp, struct aer_err_info *info)
  {
  	u8 bus = info->id >> 8;
  	u8 devfn = info->id & 0xff;
+	struct pci_dev *dev;
+	bool ratelimited = false;
+	int i;
- pci_info(rp, "%s%s error message received from %04x:%02x:%02x.%d\n",
-		 info->multi_error_valid ? "Multiple " : "",
-		 aer_error_severity_string[info->severity],
-		 pci_domain_nr(rp->bus), bus, PCI_SLOT(devfn),
-		 PCI_FUNC(devfn));
+	/* extract endpoint device ratelimit */
+	for (i = 0; i < info->error_dev_num; i++) {
+		dev = info->dev[i];
+		if (info->id == pci_dev_id(dev)) {
+			ratelimited = info->ratelimited[i];
+			break;
+		}
+	}

(please correct me if I'm misreading the patch)

It looks like we ratelimit the Root Port logs based on the source device that generated the message, and the actual errors in aer_process_err_devices() use their own ratelimits. As you noted in one of your emails, there might be the case where we report errors but there's no information about the Root Port that issued the interrupt we're handling.

The way I understood the suggestion in 20250320202913.GA1097165@bhelgaas is that we evaluate the ratelimit of the Root Port or Downstream Port, save it in aer_err_info, and use it in aer_print_rp_info() and aer_print_error(). I'm worried that one noisy device under a Root Port could hit a ratelimit and hide everything else

A fair (and complicated) solution would be to check the ratelimits of all devices in the Error Message to see if there is at least one that can be reported. If so, use that ratelimit when printing both the Root Port info and the error details from that device.

This is to say that if we keep aer_print_rp_info() (which was decided a couple emails ago), we should print it before any error coming from that Root Port is reported.

All the best,
Karolina




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux