On Fri, Aug 08, 2025 at 10:23:45AM +0800, Hui Wang wrote: > Hi Bjorn, > > Any progress on this issue, do we have a fix for this now? The > ubuntu users are waiting for a fix :-). Not yet, but thanks for the reminder. Keep bugging me! PCIe r7.0, sec 2.3.1, makes it clear that devices are permitted to return RRS after FLR: ◦ For Configuration Requests only, if Device Readiness Status is not supported, following reset it is permitted for a Function to terminate the request and indicate that it is temporarily unable to process the Request, but will be able to process the Request in the future - in this case, the Request Retry Status (RRS) Completion Status must be used (see § Section 6.6). Valid reset conditions after which a device/Function is permitted to return RRS in response to a Configuration Request are: ▪ FLRs ... But I am a little bit concerned because sec 2.3.2, which talks about how a Root Complex handles that RRS and the RRS Software Visiblity feature, says (note the "system reset" period): Root Complex handling of a Completion with Request Retry Status for a Configuration Request is implementation specific, except for the period following SYSTEM RESET (see § Section 6.6). For Root Complexes that support Configuration RRS Software Visibility, the following rules apply: ◦ If Configuration RRS Software Visibility is enabled: ▪ For a Configuration Read Request that includes both bytes of the Vendor ID field of a device Function's Configuration Space Header, the Root Complex must complete the Request to the host by returning a read-data value of 0001h for the Vendor ID field and all 1's for any additional bytes included in the request. So I'm worried that the Software Visibility feature might work after *system reset*, but not necessarily after an FLR. That might make sense because I don't think the RC can tell when we are doing an FLR to a device. It seems that after FLR, most RCs *do* make RRS visible via SV. But if we can't rely on that, I don't know how we're supposed to learn when a device becomes ready. Bjorn > On 7/3/25 08:05, Hui Wang wrote: > > On 7/2/25 17:43, Hui Wang wrote: > > > On 7/2/25 07:23, Bjorn Helgaas wrote: > > > > On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote: > > > > > Sorry for late response, I was OOO the past week. > > > > > > > > > > This is the log after applied your patch: > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61 > > > > > > > > > > Looks like the "retry" makes the nvme work. > > > > > > > > Thank you! It seems like we get 0xffffffff (probably PCIe > > > > error) for a long time after we think the device should be > > > > able to respond with RRS. > > > > > > > > I always thought the spec required that after the delays, a > > > > device should respond with RRS if it's not ready, but now I > > > > guess I'm not 100% sure. Maybe it's allowed to just do > > > > nothing, which would lead to the Root Port timing out and > > > > logging an Unsupported Request error. > > > > > > > > Can I trouble you to try the patch below? I think we might > > > > have to start explicitly checking for that error. That > > > > probably would require some setup to enable the error, check > > > > for it, and clear it. I hacked in some of that here, but > > > > ultimately some of it should go elsewhere. > ...