Re: [PATCHES v3][RFC][CFT] mount-related stuff

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 3 Sep 2025 19:14:29 +0100

On Wed, Sep 03, 2025 at 07:47:18AM -0700, Linus Torvalds wrote:
> On Tue, 2 Sept 2025 at 21:54, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > If nobody objects, this goes into #for-next.
> 
> Looks all sane to me.
> 
> What was the issue with generic/475? I have missed that context..

At some point testing that branch has caught a failure in generic/475.
Unfortunately, it wouldn't trigger on every run, so there was
a possibility that it started earlier.  

When I went digging, I've found it with trixie kernel (6.12.38 in
that kvm, at the time) rebuilt with my local config; the one used
by debian didn't trigger that.  Bisection by config converged to
PREEMPT_VOLUNTARY (no visible failures) changed to PREEMPT (failures
happen with odds a bit below 10%).

There are several failure modes; the most common is something like
...
echo '1' 2>&1 > /sys/fs/xfs/dm-0/error/fail_at_unmount
echo '0' 2>&1 > /sys/fs/xfs/dm-0/error/metadata/EIO/max_retries
echo '0' 2>&1 > /sys/fs/xfs/dm-0/error/metadata/EIO/retry_timeout_seconds
fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error)
fsstress: check_cwd failure
fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error)
fsstress: check_cwd failure
fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error)
fsstress: check_cwd failure
fsstress: check_cwd stat64() returned -1 with errno: 5 (Input/output error)
fsstress: check_cwd failure
fsstress killed (pid 10824)
fsstress killed (pid 10826)
fsstress killed (pid 10827)
fsstress killed (pid 10828)
fsstress killed (pid 10829)
umount: /home/scratch: target is busy.
unmount failed
umount: /home/scratch: target is busy.
umount: /dev/sdb2: not mounted.

in the end of output (that's mainline v6.12); other variants include e.g.
quietly hanging udevadm wait (killable).  It's bloody annoying to bisect -
100-iterations run takes about 2.5 hours and while usually a failure happens
in the first 40 minutes or so or not at all...

PREEMPT definitely is the main contributor to the failure odds...  I'm doing
a bisection between v6.12 and v6.10 at the moment, will post when I get
something more useful...