On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote: > Currently, the devlink health reporter initiates the grace period > immediately after recovering an error, which blocks further recovery > attempts until the grace period concludes. Since additional errors > are not generally expected during this short interval, any new error > reported during the grace period is not only rejected but also causes > the reporter to enter an error state that requires manual intervention. > > This approach poses a problem in scenarios where a single root cause > triggers multiple related errors in quick succession - for example, > a PCI issue affecting multiple hardware queues. Because these errors > are closely related and occur rapidly, it is more effective to handle > them together rather than handling only the first one reported and > blocking any subsequent recovery attempts. Furthermore, setting the > reporter to an error state in this context can be misleading, as these > multiple errors are manifestations of a single underlying issue, making > it unlike the general case where additional errors are not expected > during the grace period. > > To resolve this, introduce a configurable grace period delay attribute > to the devlink health reporter. This delay starts when the first error > is recovered and lasts for a user-defined duration. Once this grace > period delay expires, the actual grace period begins. After the grace > period ends, a new reported error will start the same flow again. > > Timeline summary: > > ----|--------|------------------------------/----------------------/-- > error is error is grace period delay grace period > reported recovered (recoveries allowed) (recoveries blocked) > > With grace period delay, create a time window during which recovery > attempts are permitted, allowing all reported errors to be handled > sequentially before the grace period starts. Once the grace period > begins, it prevents any further error recoveries until it ends. We are rate limiting recoveries, the "networking solution" to the problem you're describing would be to introduce a burst size. Some kind of poor man's token bucket filter. Could you say more about what designs were considered and why this one was chosen?