Re: [RFF] realpathat system call

Mateusz Guzik <mjguzik@xxxxxxxxx> · Fri, 2 May 2025 20:28:13 +0200

On Fri, May 2, 2025 at 8:01 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Fri, May 2, 2025 at 7:35 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >
> > On Fri, May 2, 2025 at 2:34 PM Jan Kara <jack@xxxxxxx> wrote:
> > >
> > > On Wed 30-04-25 22:50:23, Mateusz Guzik wrote:
> > > > Before I explain why the system call and how, I'm noting a significant
> > > > limitation upfront: in my proposal the system call is allowed to fail
> > > > with EAGAIN. It's not inherent, but I think it's the sane thing to do.
> > > > Why I think that's sensible and why it does not defeat the point is
> > > > explained later.
> > > >
> > > > Why the system call: realpath(3) is issued a lot for example by gcc
> > > > (mostly for header files). libc implements it as a series of
> > > > readlinks(!) and it unsurprisingly looks atrocious:
> > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL
> > > > (Invalid argument)
> > > > [pid 1096382] readlink("/usr/local", 0x7fffbac84f90, 1023) = -1 EINVAL
> > > > (Invalid argument)
> > > > [pid 1096382] readlink("/usr/local/include", 0x7fffbac84f90, 1023) =
> > > > -1 EINVAL (Invalid argument)
> > > > [pid 1096382] readlink("/usr/local/include/bits", 0x7fffbac84f90,
> > > > 1023) = -1 ENOENT (No such file or directory)
> > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL
> > > > (Invalid argument)
> > > > [pid 1096382] readlink("/usr/include", 0x7fffbac84f90, 1023) = -1
> > > > EINVAL (Invalid argument)
> > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu",
> > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
> > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits",
> > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
> > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types",
> > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
> > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types/FILE.h",
> > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
> > > >
> > > > and so on. This converts one path lookup to N (by path component). Not
> > > > only that's terrible single-threaded, you may also notice all these
> > > > lookups bounce lockref-containing cachelines for every path component
> > > > in face of gccs running at the same time (and highly parallel
> > > > compilations are not rare, are they).
> > > >
> > > > One way to approach this is to construct the new path on the fly. The
> > > > problem with that is that it would require some rototoiling and more
> > > > importantly is highly error prone (notably due to symlinks). This is
> > > > the bit I'm trying to avoid.
> > > >
> > > > A very pleasant way out is to instead walk the path forward, then
> > > > backward on the found dentry et voila -- all the complexity is handled
> > > > for you. There is however a catch: no forward progress guarantee.
> > >
> > > So AFAIU what you describe here is doing a path lookup and then calling
> > > d_path() on the result - actually prepend_path() as I'm glancing in your
> > > POC code.
> > >
> >
> > Ye that's the gist.
> >
> > > > rename seqlock is needed to guarantee correctness, otherwise if
> > > > someone renamed a dir as you were resolving the path forward, by the
> > > > time you walk it backwards you may get a path which would not be
> > > > accessible to you -- a result which is not possible with userspace
> > > > realpath.
> > >
> > > In presence of filesystem mutations paths are always unreliable, aren't
> > > they? I mean even with userspace realpath() implementation the moment the
> > > function call is returning the path the filesystem can be modified so that
> > > the path stops being valid. With kernel it is the same. So I don't see any
> > > strong reason to bother with handling parallel filesystem modifications.
> > > But maybe I'm missing some practically important case...
> > >
> >
> > The concern is not that the result is stale, but that it was not
> > legitimately obtainable at any point by the caller doing the current
> > realpath walk.
> >
> > Consider the following tree:
> > /foo/file
> > /bar
> >
> > where foo is 755, bar is 700 and both are owned by root, while the
> > user issuing realpath has some other uid
> >
> > if root renames /foo/file to /bar/file while racing against realpath
> > /foo/file, there is a time window where the user will find the dentry
> > and by the time they d_path the result is /bar/file. but they never
> > would get /bar/file with the current implementation.
> >
>
> That said, if this is considered fine, the entire thing turns a
> trivial patch for sure.
>

To elaborate,  the result is not obtainable in two ways, one of which
has a security angle to it.

Let's grab the tree again:
/foo/file
/bar

except both foo and bar are 755

with userspace realpath the following results are possible when racing
against rename /foo/file /bar/file:
- success, the path is returned as /foo/file
- ENOENT

with kernel realpath not checking for rename ENOENT is off the table,
instead you get:
- success, the path is returned as /bar/file

Suppose that's fine and maybe even considered an improvement.

Now consider a case where bar is 700 and the user issuing realpath
can't traverse on it.

The user would *not* get the path even if they realpath /bar/file as
they don't have perms. You could argue the user does not have the
rights to know where the file after said rename.

So I stand by the need to check rename seq.

However, I wonder if it would be tolerable to redo the lookup with the
*new* path and check if we got the same dentry at the end. If so, we
know the path was obtainable by traversal.
-- 
Mateusz Guzik <mjguzik gmail.com>