Scalable handling of writers count on a mount had been added back in 2009, when the counter got split into per-cpu components. The tricky part, of course, was on the "make it read-only" side and the way it had been done was to add a flag for "we are adding that shit up, feel free to increment, but don't get through the mnt_get_write_access() until we are finished". Details are in mnt_get_write_access() and mnt_{,un}hold_writers(). Current rules: * mnt_hold_writers()/mnt_unhold_writers() should be in the same mount_lock scope. * any successful mnt_hold_writers() must be followed by mnt_unhold_writers() within the same scope. * mnt_get_write_access() *MAY* take and release mount_lock, but only if there's somebody currently playing with mnt_hold_writers() on the same mount. The non-obvious trouble came in 2011 (in 4ed5e82fe77f "vfs: protect remounting superblock read-only") when we got sb_prepare_remount_readonly(), esssentially doing mnt_hold_writers() for each mount over given superblock. I hadn't realized the implications until this year ;-/ The trouble is, as soon as mount has been added to ->s_mounts (currently in vfs_create_mount() and clone_mnt()) it becomes accessible to other threads, even if only in a very limited way. That breaks very natural assumptions, just lightly enough to make the resulting races hard to detect. Note, for example, this in ovl_get_upper(): upper_mnt = clone_private_mount(upperpath); err = PTR_ERR(upper_mnt); if (IS_ERR(upper_mnt)) { pr_err("failed to clone upperpath\n"); goto out; } /* Don't inherit atime flags */ upper_mnt->mnt_flags &= ~(MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME); See the problem? upper_mnt has been returned by clone_private_mount() and at the moment it has no references to it anywhere other than this local variable. It is not a part of any mount tree, it is not hashed, it is not reachable via propagation graph, it is not on any expiry lists, etc. Except that it *is* on ->s_mounts of underlying superblock (linked via ->mnt_instance) and should anybody try to call sb_prepare_remount_readonly(), we may end up fucking the things up in all kinds of interesting ways - e.g. mnt_hold_writers() sets MNT_WRITE_HOLD we fetch ->mnt_flags mnt_unhold_writers() clears MNT_WRITE_HOLD we remove the atime bits and store ->mnt_flags, bringing MNT_WRITE_HOLD back. Makes for a very unhappy mnt_get_write_access() down the road... The races are real. Some of them can be dealt with by grabbing mount_lock around the problematic modifications of ->mnt_flags, but that's not an option outside of fs/namespace.c - I really don't want to export mount_lock *or* provide a "clear these flags and set these ones" exported primitive. In any case, the underlying problem is that we have this state of struct mount when it's almost, but not entirely thread-local. That's likely to cause future bugs of the same sort. I'd been playing with delaying the insertion into ->s_mounts and making mnt_get_write_access() fail and complain if mount hasn't been inserted yet. It's... doable, but not pleasant. However, there's another approach that might be better. We have ->mnt_instance, which is used only under mount_lock. We have two places where we insert, one place where we remove and two places (both in sb_prepare_remount_readonly()) where we iterate through the list. What if we steal LSB of ->mnt_instance.prev for what MNT_WRITE_HOLD is currently used for? list_for_each_entry() won't give a damn; list_del() and list_add_tail() are called in mount_lock scopes that do not contain any calls of mnt_hold_writers(), so they'll see that LSB clear and work as usual. Loop in mnt_get_write_access() could just as easily do while ((unsigned long)READ_ONCE(mnt->mnt_instance.prev) & 1) { as the current while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { and no barriers need to be changed. We might want to move ->mnt_instance closer to ->mnt or to ->mnt_pcp, for the sake of the number of cachelines we are accessing, but then ->mnt_instance is pretty close to ->mnt_pcp as it is. Something like the delta below, modulo stale comments, etc. AFAICS, that removes this "slightly exposed" state, along with the races caused by it. This is on top of mainline and it should be easy to backport. Comments? diff --git a/fs/namespace.c b/fs/namespace.c index 54c59e091919..7e3145baac8f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -506,7 +506,7 @@ int mnt_get_write_access(struct vfsmount *m) */ smp_mb(); might_lock(&mount_lock.lock); - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { + while ((unsigned long)READ_ONCE(mnt->mnt_instance.prev) & 1) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { cpu_relax(); } else { preempt_enable(); - lock_mount_hash(); - unlock_mount_hash(); + read_seqlock_excl(&mount_lock); + read_sequnlock_excl(&mount_lock); preempt_disable(); } } @@ -650,6 +650,11 @@ void mnt_drop_write_file(struct file *file) } EXPORT_SYMBOL(mnt_drop_write_file); +static inline unsigned long __mnt_writers_held(const struct mount *m) +{ + return (unsigned long)m->mnt_instance.prev; +} + /** * mnt_hold_writers - prevent write access to the given mount * @mnt: mnt to prevent write access to @@ -670,7 +675,7 @@ EXPORT_SYMBOL(mnt_drop_write_file); */ static inline int mnt_hold_writers(struct mount *mnt) { - mnt->mnt.mnt_flags |= MNT_WRITE_HOLD; + mnt->mnt_instance.prev = (void *)(__mnt_writers_held(mnt) | 1); /* * After storing MNT_WRITE_HOLD, we'll read the counters. This store * should be visible before we do. @@ -718,7 +723,7 @@ static inline void mnt_unhold_writers(struct mount *mnt) * that become unheld will see MNT_READONLY. */ smp_wmb(); - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + mnt->mnt_instance.prev = (void *)(__mnt_writers_held(mnt) & ~1); } static int mnt_make_readonly(struct mount *mnt) @@ -755,8 +760,9 @@ int sb_prepare_remount_readonly(struct super_block *sb) if (!err) sb_start_ro_state_change(sb); list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) { - if (mnt->mnt.mnt_flags & MNT_WRITE_HOLD) - mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD; + unsigned long v = __mnt_writers_held(mnt); + if (v & 1) + mnt->mnt_instance.prev = (void *)(v & ~1); } unlock_mount_hash(); @@ -1349,7 +1355,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, } mnt->mnt.mnt_flags = old->mnt.mnt_flags; - mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL); + mnt->mnt.mnt_flags &= ~(MNT_MARKED|MNT_INTERNAL); atomic_inc(&sb->s_active); mnt->mnt.mnt_idmap = mnt_idmap_get(mnt_idmap(&old->mnt)); @@ -4959,7 +4965,7 @@ static int mount_setattr_prepare(struct mount_kattr *kattr, struct mount *mnt) */ for (p = mnt; p; p = next_mnt(p, mnt)) { /* If we had to hold writers unblock them. */ - if (p->mnt.mnt_flags & MNT_WRITE_HOLD) + if (__mnt_writers_held(mnt) & 1) mnt_unhold_writers(p); /* @@ -4999,7 +5005,7 @@ static void mount_setattr_commit(struct mount_kattr *kattr, struct mount *mnt) WRITE_ONCE(m->mnt.mnt_flags, flags); /* If we had to hold writers unblock them. */ - if (m->mnt.mnt_flags & MNT_WRITE_HOLD) + if (__mnt_writers_held(mnt) & 1) mnt_unhold_writers(m); if (kattr->propagation) diff --git a/include/linux/mount.h b/include/linux/mount.h index 1a508beba446..1728bc50d02b 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -33,7 +33,6 @@ enum mount_flags { MNT_NOSYMFOLLOW = 0x80, MNT_SHRINKABLE = 0x100, - MNT_WRITE_HOLD = 0x200, MNT_SHARED = 0x1000, /* if the vfsmount is a shared mount */ MNT_UNBINDABLE = 0x2000, /* if the vfsmount is a unbindable mount */ @@ -64,7 +63,7 @@ enum mount_flags { | MNT_READONLY | MNT_NOSYMFOLLOW, MNT_ATIME_MASK = MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME, - MNT_INTERNAL_FLAGS = MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | + MNT_INTERNAL_FLAGS = MNT_SHARED | MNT_INTERNAL | MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED | MNT_LOCKED, };