I believe this diff solves a number of EAGAIN, disconnect and livelock issues in cases where the layout needs to be refreshed due to the mirror state changing. ff_lseg_match_mirrors will always return true which means we aggressively merge lsegs in a variety of cases. The problematic interaction happens in pnfs_generic_layout_insert_lseg: if (do_merge(lseg, lp)) { mark_lseg_invalid(lp, free_me); continue; } My reading of this code is if we decide that the new lseg that we are inserting is mergeable with the existing lseg, we mutate the state of the lseg that we are inserting and then we mark the existing cached lseg invalid. In the stress test results that I've reviewed, marking the lseg invalid causes a large number of undesirable side effects. This is because there can be large number of parallel syscalls that currently hold a reference to that lseg. Marking the lseg invalid generally causes the syscall to return EAGAIN when it wakes up. I also see code paths where we RESET_TO_PNFS. I also see lots of disconnects which I believe are coming from ff_layout_cancel_io. One way I believe we can make it to that path is if parallel IO calls pnfs_update_layout in the race between when we mark_lseg_invalid after we've decided to merge but before we actually insert it. I think this code path could be further improved by inventing another way of marking merged layouts. I don't think they need to be invalidated, perhaps a less destructive state like "stale" could be invented that lets existing IO finish before cleaning up the lseg. Jonathan Curley (1): NFSv4/flexfiles: Fix layout merge mirror check. fs/nfs/flexfilelayout/flexfilelayout.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.34.1