On Fri, May 2, 2025 at 8:01 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > On Fri, May 2, 2025 at 7:35 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > On Fri, May 2, 2025 at 2:34 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > On Wed 30-04-25 22:50:23, Mateusz Guzik wrote: > > > > Before I explain why the system call and how, I'm noting a significant > > > > limitation upfront: in my proposal the system call is allowed to fail > > > > with EAGAIN. It's not inherent, but I think it's the sane thing to do. > > > > Why I think that's sensible and why it does not defeat the point is > > > > explained later. > > > > > > > > Why the system call: realpath(3) is issued a lot for example by gcc > > > > (mostly for header files). libc implements it as a series of > > > > readlinks(!) and it unsurprisingly looks atrocious: > > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > (Invalid argument) > > > > [pid 1096382] readlink("/usr/local", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > (Invalid argument) > > > > [pid 1096382] readlink("/usr/local/include", 0x7fffbac84f90, 1023) = > > > > -1 EINVAL (Invalid argument) > > > > [pid 1096382] readlink("/usr/local/include/bits", 0x7fffbac84f90, > > > > 1023) = -1 ENOENT (No such file or directory) > > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > (Invalid argument) > > > > [pid 1096382] readlink("/usr/include", 0x7fffbac84f90, 1023) = -1 > > > > EINVAL (Invalid argument) > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu", > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits", > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types", > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types/FILE.h", > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > > > > > and so on. This converts one path lookup to N (by path component). Not > > > > only that's terrible single-threaded, you may also notice all these > > > > lookups bounce lockref-containing cachelines for every path component > > > > in face of gccs running at the same time (and highly parallel > > > > compilations are not rare, are they). > > > > > > > > One way to approach this is to construct the new path on the fly. The > > > > problem with that is that it would require some rototoiling and more > > > > importantly is highly error prone (notably due to symlinks). This is > > > > the bit I'm trying to avoid. > > > > > > > > A very pleasant way out is to instead walk the path forward, then > > > > backward on the found dentry et voila -- all the complexity is handled > > > > for you. There is however a catch: no forward progress guarantee. > > > > > > So AFAIU what you describe here is doing a path lookup and then calling > > > d_path() on the result - actually prepend_path() as I'm glancing in your > > > POC code. > > > > > > > Ye that's the gist. > > > > > > rename seqlock is needed to guarantee correctness, otherwise if > > > > someone renamed a dir as you were resolving the path forward, by the > > > > time you walk it backwards you may get a path which would not be > > > > accessible to you -- a result which is not possible with userspace > > > > realpath. > > > > > > In presence of filesystem mutations paths are always unreliable, aren't > > > they? I mean even with userspace realpath() implementation the moment the > > > function call is returning the path the filesystem can be modified so that > > > the path stops being valid. With kernel it is the same. So I don't see any > > > strong reason to bother with handling parallel filesystem modifications. > > > But maybe I'm missing some practically important case... > > > > > > > The concern is not that the result is stale, but that it was not > > legitimately obtainable at any point by the caller doing the current > > realpath walk. > > > > Consider the following tree: > > /foo/file > > /bar > > > > where foo is 755, bar is 700 and both are owned by root, while the > > user issuing realpath has some other uid > > > > if root renames /foo/file to /bar/file while racing against realpath > > /foo/file, there is a time window where the user will find the dentry > > and by the time they d_path the result is /bar/file. but they never > > would get /bar/file with the current implementation. > > > > That said, if this is considered fine, the entire thing turns a > trivial patch for sure. > To elaborate, the result is not obtainable in two ways, one of which has a security angle to it. Let's grab the tree again: /foo/file /bar except both foo and bar are 755 with userspace realpath the following results are possible when racing against rename /foo/file /bar/file: - success, the path is returned as /foo/file - ENOENT with kernel realpath not checking for rename ENOENT is off the table, instead you get: - success, the path is returned as /bar/file Suppose that's fine and maybe even considered an improvement. Now consider a case where bar is 700 and the user issuing realpath can't traverse on it. The user would *not* get the path even if they realpath /bar/file as they don't have perms. You could argue the user does not have the rights to know where the file after said rename. So I stand by the need to check rename seq. However, I wonder if it would be tolerable to redo the lookup with the *new* path and check if we got the same dentry at the end. If so, we know the path was obtainable by traversal. -- Mateusz Guzik <mjguzik gmail.com>