Re: [PATCH v5 6/8] PCI/AER: Introduce ratelimit for error logs

Karolina Stolarek <karolina.stolarek@xxxxxxxxxx> · Fri, 4 Apr 2025 11:32:30 +0200

On 21/03/2025 19:41, Jon Pan-Doh wrote:
On Fri, Mar 21, 2025 at 6:46 AM Karolina Stolarek
<karolina.stolarek@xxxxxxxxxx> wrote:

On 21/03/2025 02:58, Jon Pan-Doh wrote:
   void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
-                  const char *level)
+                  const char *level, bool ratelimited)

Ideally, we would like to be able to extract the "ratelimited" flag from
the aer_err_info struct, with no need for extra parameters in this function.

I considered this, but pci_print_aer() and dpc_process_aer() both call
aer_print_error directly without populating info->dev[]. I thought it
was easier/cleaner to pass it in vs. populate info->dev[] and then
read it.

We decided to not rate limit DPC and with my patch, eventually, 
hopefully, landing upstream, we will stop caring about pci_print_aer() 
altogether.

It looks like we ratelimit the Root Port logs based on the source device
that generated the message, and the actual errors in
aer_process_err_devices() use their own ratelimits. As you noted in one
of your emails, there might be the case where we report errors but
there's no information about the Root Port that issued the interrupt
we're handling.

Your understanding is correct. I think the edge case described is
acceptable behavior:

1. Multiple errors arrive where the first source device is ratelimited
2. Root port log and first source device log are not printed
3. Other error source device(s) logs are printed

I find it inconsistent. I won't block the series on the basis of this 
change but wanted to point out that such a thing can happen.

(...)

The way I understood the suggestion in 20250320202913.GA1097165@bhelgaas
is that we evaluate the ratelimit of the Root Port or Downstream Port,
save it in aer_err_info, and use it in aer_print_rp_info() and
aer_print_error(). I'm worried that one noisy device under a Root Port
could hit a ratelimit and hide everything else

This is not a 100% translation of Bjorn's suggestion as I share your
concern (1 spammy device ratelimits and hides other devices).

I understand.

A fair (and complicated) solution would be to check the ratelimits of
all devices in the Error Message to see if there is at least one that
can be reported. If so, use that ratelimit when printing both the Root
Port info and the error details from that device.

I mentioned this as well[1], albeit briefly. I'd opt for this if my
initial solution isn't satisfactory. You could make it more
complicated by substituting the source device, if it is ratelimited,
to a non-ratelimited one. However, it's not good as it changes the
default expectation that the root port log would have the source ID.

Oh yes, that sounds overly complicated. I have some doubts about missing 
Root Port logs in that specific case (even if they are confusing and may 
point to the ratelimited source), but let's go with the series as it is.

All the best,
Karolina