Re: [PATCH v4] fs/namespace: defer RCU sync for MNT_DETACH umount

Ian Kent <raven@xxxxxxxxxx> · Fri, 18 Apr 2025 08:31:03 +0800

On 17/4/25 23:12, Christian Brauner wrote:
On Thu, Apr 17, 2025 at 01:31:40PM +0200, Christian Brauner wrote:
On Thu, Apr 17, 2025 at 06:17:01PM +0800, Ian Kent wrote:
On 17/4/25 17:01, Christian Brauner wrote:
On Wed, Apr 16, 2025 at 11:11:51PM +0100, Mark Brown wrote:
On Tue, Apr 08, 2025 at 04:58:34PM -0400, Eric Chanudet wrote:
Defer releasing the detached file-system when calling namespace_unlock()
during a lazy umount to return faster.

When requesting MNT_DETACH, the caller does not expect the file-system
to be shut down upon returning from the syscall. Calling
synchronize_rcu_expedited() has a significant cost on RT kernel that
defaults to rcupdate.rcu_normal_after_boot=1. Queue the detached struct
mount in a separate list and put it on a workqueue to run post RCU
grace-period.
For the past couple of days we've been seeing failures in a bunch of LTP
filesystem related tests on various arm64 systems.  The failures are
mostly (I think all) in the form:

20101 10:12:40.378045  tst_test.c:1833: TINFO: === Testing on vfat ===
20102 10:12:40.385091  tst_test.c:1170: TINFO: Formatting /dev/loop0 with vfat opts='' extra opts=''
20103 10:12:40.391032  mkfs.vfat: unable to open /dev/loop0: Device or resource busy
20104 10:12:40.395953  tst_test.c:1170: TBROK: mkfs.vfat failed with exit code 1

ie, a failure to stand up the test environment on the loopback device
all happening immediately after some other filesystem related test which
also used the loop device.  A bisect points to commit a6c7a78f1b6b97
which is this, which does look rather relevant.  LTP is obviously being
very much an edge case here.
Hah, here's something I didn't consider and that I should've caught.

Look, on current mainline no matter if MNT_DETACH/UMOUNT_SYNC or
non-MNT_DETACH/UMOUNT_SYNC. The mntput() calls after the
synchronize_rcu_expedited() calls will end up in task_work():

          if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
                  struct task_struct *task = current;
                  if (likely(!(task->flags & PF_KTHREAD))) {
                          init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
                          if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
                                  return;
                  }
                  if (llist_add(&mnt->mnt_llist, &delayed_mntput_list))
                          schedule_delayed_work(&delayed_mntput_work, 1);
                  return;
          }

because all of those mntput()s are done from the task's contect.

IOW, if userspace does umount(MNT_DETACH) and the task has returned to
userspace it is guaranteed that all calls to cleanup_mnt() are done.

With your change that simply isn't true anymore. The call to
queue_rcu_work() will offload those mntput() to be done from a kthread.
That in turn means all those mntputs end up on the delayed_mntput_work()
queue. So the mounts aren't cleaned up by the time the task returns to
userspace.

And that's likely problematic even for the explicit MNT_DETACH use-case
because it means EBUSY errors are a lot more likely to be seen by
concurrent mounters especially for loop devices.

And fwiw, this is exactly what I pointed out in a prior posting to this
patch series.
And I didn't understand what you said then but this problem is more

understandable to me now.
I mean I'm saying it could be problematic for the MNT_DETACH case. I'm
not sure how likely it is. If some process P1 does MNT_DETACH on a loop
device and then another process P2 wants to use that loop device and
sess EBUSY then we don't care. That can already happen. But even in this
case I'm not sure if there aren't subtle ways where this will bite us.

But there's two other problems:

(1) The real issue is with the same process P1 doing stupid stuff that
     just happened to work. For example if there's code out there that
     does a MNT_DETACH on a filesystem that uses a loop device
     immediately followed by the desire to reuse the loop device:

     It doesn't matter whether such code must in theory already be
     prepared to handle the case of seeing EBUSY after the MNT_DETACH. If
     this currently just happens to work because we guarantee that the
     last mntput() and cleanup_mnt() will have been done when the caller
     returns to userspace it's a uapi break plain and simple.

     This implicit guarantee isn't there anymore after your patch because
     the final mntput() from is done from the system_unbound_wq which has
     the consequence that the cleanup_mnt() is done from the
     delayed_mntput_work workqeueue. And that leads to problem number
     (2).

This is a bit puzzling to me.

All the mounts in the tree should be unhashed before any of these mntput()

calls so I didn't think it would be found. I'll need to look at the loop

device case to work out how it's finding (or holing onto) the stale mount

and concluding it's busy.

Although there was place I was concerned about:

                disconnect = disconnect_mount(p, how);
                if (mnt_has_parent(p)) {
                        mnt_add_count(p->mnt_parent, -1);
                        if (!disconnect) {
                                /* Don't forget about p */
list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
                        } else {
                                umount_mnt(p);
                        }
                }

here p ends up not being unhashed so I need to look again at this area and

try and understand the special cases.

(2) If a userspace task is dealing with e.g., a broken NFS server and
     does a umount(MNT_DETACH) and that NFS server blocks indefinitely
     then right now it will be the task's problem that called the umount.
     It will simply hang and pay the price.

     With your patch however, that cleanup_mnt() and the
     deactivate_super() call it entails will be done from
     delayed_mntput_work...

     So if there's some userspace process with a broken NFS server and it
     does umount(MNT_DETACH) it will end up hanging every other
     umount(MNT_DETACH) on the system because the dealyed_mntput_work
     workqueue (to my understanding) cannot make progress.

     So in essence this patch to me seems like handing a DOS vector for
     MNT_DETACH to userspace.

Umm ... this is a really serious problem and can fairly easily happen and

as much as NFS has improved over the years it still happens.

But again, the problem is not so much waiting for rcu-walks to be done as

waiting longer than is needed ...

But we've also worsened that situation by doing the deferred thing for
any non-UMOUNT_SYNC. That which includes namespace exit. IOW, if the
last task in a new mount namespace exits it will drop_collected_mounts()
without UMOUNT_SYNC because we know that they aren't reachable anymore,
after all the mount namespace is dead.

But now we defer all cleanup to the kthread which means when the task
returns to userspace there's still mounts to be cleaned up.
Correct me if I'm wrong but the actual problem is that the mechanism used

to wait until there are no processes doing an rcu-walk on mounts in the

discard list is unnecessarily long according to what Eric has seen. So a
I think that the current approach is still salvagable but I need to test
this and currently LTP doesn't really compile for me.
I managed to reproduce this and it is like I suspected. I just thought
"Oh well, if it's not UMOUNT_SYNC then we can just punt this to a
workqueue." which is obviously not going to work.

If a mount namespace gets cleaned up or if a detached mount tree is
cleaned up or if audit calls drop_collected_mounts() we obviously don't
want to defer the unmount. So the fix is to simply restrict this to
userspace actually requesting MNT_DETACH.

But I'm seeing way more substantial issues with this patch.

Yeah, it's certainly much more difficult than it looks at first sight.

Ian