On Fri, 5 Sep 2025 22:51:37 +0800 Dust Li <dust.li@xxxxxxxxxxxxxxxxx> wrote: > >>Did some research and some thinking. Are you concerned about a > >>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According > >>to my current understanding the RNR must not lead to a catastrophic > >>failure, but the RDMA/IB stack is supposed to handle that. > > > >No, it's not just a performance regression. > >If we get an RNR when going from 64 -> 16, the whole link group gets > >torn down — and all SMC connections inside it break. > >So from the user’s point of view, connections will just randomly drop > >out of nowhere. > > I double-checked the code and noticed we set qp_attr.rnr_retry = > SMC_QP_RNR_RETRY = 7, which means "infinite retries." > So the QP will just keep retrying — we won't actually get an RNR. > That said, yeah, just performance regression. > > So in this case, I would regard it as acceptable. We can go with this. Yes, that is consistent with Mahanta's testing in a sense that he did not see any catastrophic failure. Regarding the performance regression, I don't know how bad it is. Mahanta was so kind to do most of the testing. So that leaves us with replacing tabs with spaces and maybe with the names, or? If you have a proposal for a better name let's talk about is. BTW are you aware of any generic counters that would help with figuring out how many RNRs have been sent/received? Regards, Halil