Re: [PATCH bpf-next 3/4] bpf: Introduce path iterator

Mickaël Salaün <mic@xxxxxxxxxxx> · Fri, 30 May 2025 14:20:39 +0200

On Thu, May 29, 2025 at 05:42:16PM -0700, Song Liu wrote:
> On Thu, May 29, 2025 at 4:10 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Thu, May 29, 2025 at 03:13:10PM -0700, Song Liu wrote:
> >
> > > Is it an issue if we only hold a reference to a MNT_LOCKED mount for
> > > short period of time? "Short period" means it may get interrupted, page
> > > faults, or wait for an IO (read xattr), but it won't hold a reference to the
> > > mount and sleep indefinitely.
> >
> > MNT_LOCKED mount itself is not a problem.  What shouldn't be done is
> > looking around in the mountpoint it covers.  It depends upon the things
> > you are going to do with that, but it's very easy to get an infoleak
> > that way.
> >
> > > > OTOH, there's a good cause for moving some of the flags, MNT_LOCKED
> > > > included, out of ->mnt_flags and into a separate field in struct mount.
> > > > However, that would conflict with any code using that to deal with
> > > > your iterator safely.
> > > >
> > > > What's more, AFAICS in case of a stack of mounts each covering the root
> > > > of parent mount, you stop in each of those.  The trouble is, umount(2)
> > > > propagation logics assumes that intermediate mounts can be pulled out of
> > > > such stack without causing trouble.  For pathname resolution that is
> > > > true; it goes through the entire stack atomically wrt that stuff.
> > > > For your API that's not the case; somebody who has no idea about an
> > > > intermediate mount being there might get caught on it while it's getting
> > > > pulled from the stack.
> > > >
> > > > What exactly do you need around the mountpoint crossing?
> > >
> > > I thought about skipping intermediate mounts (that are hidden by
> > > other mounts). AFAICT, not skipping them will not cause any issue.
> >
> > It can.  Suppose e.g. that /mnt gets propagation from another namespace,
> > but not the other way round and you mount something on /mnt.
> >
> > Later, in that another namespace, somebody mounts something on wherever
> > your /mnt gets propagation to.  A copy will be propagated _between_
> > your /mnt and whatever you've mounted on top of it; it will be entirely
> > invisible until you umount your /mnt.  At that point the propagated
> > copy will show up there, same as if it had appeared just after your
> > umount.  Prior to that it's entirely invisible.  If its original
> > counterpart in another namespace gets unmounted first, the copy will
> > be quietly pulled out.
> 
> Thanks for sharing this information!
> 
> > Note that choose_mountpoint_rcu() callers (including choose_mountpoint())
> > will have mount_lock seqcount sampled before the traversal _and_ recheck
> > it after having reached the bottom of stack.  IOW, if you traverse ..
> > on the way to root, you won't get caught on the sucker being pulled out.
> 
> In some of our internal discussions, we talked about using
> choose_mountpoint() instead of follow_up(). I didn't go that direction in this
> version because it requires holding "root". But if it makes more sense
> to use, choose_mountpoint(), we sure can hold "root".
> 
> Alternatively, I think it is also OK to pass a zero'ed root to
> choose_mountpoint().
> 
> > Your iterator, OTOH, would stop in that intermediate mount - and get
> > an unpleasant surprise when it comes back to do the next step (towards
> > /mnt on root filesystem, that is) and finds that path->mnt points
> > to something that is detached from everything - no way to get from
> > it any further.  That - despite the fact that location you've started
> > from is still mounted, still has the same pathname, etc. and nothing
> > had been disrupted for it.
> >
> > And yes, landlock has a narrow race in the matching place.  Needs to
> > be fixed.  At least it does ignore those as far as any decisions are
> > concerned...

Thanks for pointing this out.  In the case of Landlock, walking to a
disconnected mount point (because of this umount race condition) would
deny the requested access whereas it may be allowed otherwise.  This is
not a security issue but still an issue because an event unrelated to
the request (umount) can abort a path resolution, which should not be
the case.

Without access to mount_lock, what would be the best way to fix this
Landlock issue while making it backportable?

> 
> If we update path_parent in this patchset with choose_mountpoint(),
> and use it in Landlock, we will close this race condition, right?

choose_mountpoint() is currently private, but if we add a new filesystem
helper, I think the right approach would be to expose follow_dotdot(),
updating its arguments with public types.  This way the intermediates
mount points will not be exposed, RCU optimization will be leveraged,
and usage of this new helper will be simplified.

> 
> >
> > Note, BTW, that it might be better off by doing that similar to
> > d_path.c - without arseloads of dget_parent/dput et.al.; not sure
> > how feasible it is, but if everything in it can be done under
> > rcu_read_lock(), that's something to look into.
> 
> I don't think we can do everything here inside rcu_read_lock().
> But d_path.c does have some code we can probably reuse or
> learn from. Also, we probably need two variations of iterators,
> one walk until absolute root, while the other walk until root of
> current->fs, just like d_path() vs. d_absolute_path(). Does this
> sound reasonable?

Passing the root to a public follow_dotdot() helper should do the job.

> 
> Thanks,
> Song
>