Bjorn, On Mon, Aug 11, 2025 at 06:04:45PM -0500, Bjorn Helgaas wrote: > On Fri, Aug 08, 2025 at 10:23:45AM +0800, Hui Wang wrote: > > Hi Bjorn, > > > > Any progress on this issue, do we have a fix for this now? The > > ubuntu users are waiting for a fix :-). > > Not yet, but thanks for the reminder. Keep bugging me! Other distributions' users waiting for the fix too! Thanks, > > PCIe r7.0, sec 2.3.1, makes it clear that devices are permitted to > return RRS after FLR: > > ◦ For Configuration Requests only, if Device Readiness Status is not > supported, following reset it is permitted for a Function to > terminate the request and indicate that it is temporarily unable > to process the Request, but will be able to process the Request in > the future - in this case, the Request Retry Status (RRS) > Completion Status must be used (see § Section 6.6). Valid reset > conditions after which a device/Function is permitted to return > RRS in response to a Configuration Request are: > > ▪ FLRs > > ... > > But I am a little bit concerned because sec 2.3.2, which talks about > how a Root Complex handles that RRS and the RRS Software Visiblity > feature, says (note the "system reset" period): > > Root Complex handling of a Completion with Request Retry Status for > a Configuration Request is implementation specific, except for the > period following SYSTEM RESET (see § Section 6.6). For Root > Complexes that support Configuration RRS Software Visibility, the > following rules apply: > > ◦ If Configuration RRS Software Visibility is enabled: > > ▪ For a Configuration Read Request that includes both bytes of > the Vendor ID field of a device Function's Configuration Space > Header, the Root Complex must complete the Request to the host > by returning a read-data value of 0001h for the Vendor ID > field and all 1's for any additional bytes included in the > request. > > So I'm worried that the Software Visibility feature might work after > *system reset*, but not necessarily after an FLR. That might make > sense because I don't think the RC can tell when we are doing an FLR > to a device. > > It seems that after FLR, most RCs *do* make RRS visible via SV. But > if we can't rely on that, I don't know how we're supposed to learn > when a device becomes ready. > > Bjorn > > > On 7/3/25 08:05, Hui Wang wrote: > > > On 7/2/25 17:43, Hui Wang wrote: > > > > On 7/2/25 07:23, Bjorn Helgaas wrote: > > > > > On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote: > > > > > > Sorry for late response, I was OOO the past week. > > > > > > > > > > > > This is the log after applied your patch: > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61 > > > > > > > > > > > > Looks like the "retry" makes the nvme work. > > > > > > > > > > Thank you! It seems like we get 0xffffffff (probably PCIe > > > > > error) for a long time after we think the device should be > > > > > able to respond with RRS. > > > > > > > > > > I always thought the spec required that after the delays, a > > > > > device should respond with RRS if it's not ready, but now I > > > > > guess I'm not 100% sure. Maybe it's allowed to just do > > > > > nothing, which would lead to the Root Port timing out and > > > > > logging an Unsupported Request error. > > > > > > > > > > Can I trouble you to try the patch below? I think we might > > > > > have to start explicitly checking for that error. That > > > > > probably would require some setup to enable the error, check > > > > > for it, and clear it. I hacked in some of that here, but > > > > > ultimately some of it should go elsewhere. > > ...