On Tue, Jun 10, 2025 at 09:21:33AM +0100, Al Viro wrote: > Original rationale for those had been the reduced cost of mntput() > for the stuff that is mounted somewhere. Mount refcount increments and > decrements are frequent; what's worse, they tend to concentrate on the > same instances and cacheline pingpong is quite noticable. > > As the result, mount refcounts are per-cpu; that allows a very cheap > increment. Plain decrement would be just as easy, but decrement-and-test > is anything but (we need to add the components up, with exclusion against > possible increment-from-zero, etc.). > > Fortunately, there is a very common case where we can tell that decrement > won't be the final one - if the thing we are dropping is currently > mounted somewhere. We have an RCU delay between the removal from mount > tree and dropping the reference that used to pin it there, so we can > just take rcu_read_lock() and check if the victim is mounted somewhere. > If it is, we can go ahead and decrement without and further checks - > the reference we are dropping is not the last one. If it isn't, we > get all the fun with locking, carefully adding up components, etc., > but the majority of refcount decrements end up taking the fast path. > > There is a major exception, though - pipes and sockets. Those live > on the internal filesystems that are not going to be mounted anywhere. > They are not going to be _un_mounted, of course, so having to take the > slow path every time a pipe or socket gets closed is really obnoxious. > Solution had been to mark them as long-lived ones - essentially faking > "they are mounted somewhere" indicator. > > With minor modification that works even for ones that do eventually get > dropped - all it takes is making sure we have an RCU delay between > clearing the "mounted somewhere" indicator and dropping the reference. > > There are some additional twists (if you want to drop a dozen of such > internal mounts, you'd be better off with clearing the indicator on > all of them, doing an RCU delay once, then dropping the references), > but in the basic form it had been > * use kern_mount() if you want your internal mount to be > a long-term one. > * use kern_unmount() to undo that. > > Unfortunately, the things did rot a bit during the mount API reshuffling. > In several cases we have lost the "fake the indicator" part; kern_unmount() > on the unmount side remained (it doesn't warn if you use it on a mount > without the indicator), but all benefits regaring mntput() cost had been > lost. > > To get rid of that bitrot, let's add a new helper that would work > with fs_context-based API: fc_mount_longterm(). It's a counterpart > of fc_mount() that does, on success, mark its result as long-term. > It must be paired with kern_unmount() or equivalents. > > Converted: > 1) mqueue (it used to use kern_mount_data() and the umount side > is still as it used to be) > 2) hugetlbfs (used to use kern_mount_data(), internal mount is > never unmounted in this one) > 3) i915 gemfs (used to be kern_mount() + manual remount to set > options, still uses kern_unmount() on umount side) > 4) v3d gemfs (copied from i915) > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > --- Reviewed-by: Christian Brauner <brauner@xxxxxxxxxx>