On Fri 02-05-25 20:28:13, Mateusz Guzik wrote: > On Fri, May 2, 2025 at 8:01 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > On Fri, May 2, 2025 at 7:35 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > > > On Fri, May 2, 2025 at 2:34 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > On Wed 30-04-25 22:50:23, Mateusz Guzik wrote: > > > > > Before I explain why the system call and how, I'm noting a significant > > > > > limitation upfront: in my proposal the system call is allowed to fail > > > > > with EAGAIN. It's not inherent, but I think it's the sane thing to do. > > > > > Why I think that's sensible and why it does not defeat the point is > > > > > explained later. > > > > > > > > > > Why the system call: realpath(3) is issued a lot for example by gcc > > > > > (mostly for header files). libc implements it as a series of > > > > > readlinks(!) and it unsurprisingly looks atrocious: > > > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > > (Invalid argument) > > > > > [pid 1096382] readlink("/usr/local", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > > (Invalid argument) > > > > > [pid 1096382] readlink("/usr/local/include", 0x7fffbac84f90, 1023) = > > > > > -1 EINVAL (Invalid argument) > > > > > [pid 1096382] readlink("/usr/local/include/bits", 0x7fffbac84f90, > > > > > 1023) = -1 ENOENT (No such file or directory) > > > > > [pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL > > > > > (Invalid argument) > > > > > [pid 1096382] readlink("/usr/include", 0x7fffbac84f90, 1023) = -1 > > > > > EINVAL (Invalid argument) > > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu", > > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits", > > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types", > > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > > [pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types/FILE.h", > > > > > 0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument) > > > > > > > > > > and so on. This converts one path lookup to N (by path component). Not > > > > > only that's terrible single-threaded, you may also notice all these > > > > > lookups bounce lockref-containing cachelines for every path component > > > > > in face of gccs running at the same time (and highly parallel > > > > > compilations are not rare, are they). > > > > > > > > > > One way to approach this is to construct the new path on the fly. The > > > > > problem with that is that it would require some rototoiling and more > > > > > importantly is highly error prone (notably due to symlinks). This is > > > > > the bit I'm trying to avoid. > > > > > > > > > > A very pleasant way out is to instead walk the path forward, then > > > > > backward on the found dentry et voila -- all the complexity is handled > > > > > for you. There is however a catch: no forward progress guarantee. > > > > > > > > So AFAIU what you describe here is doing a path lookup and then calling > > > > d_path() on the result - actually prepend_path() as I'm glancing in your > > > > POC code. > > > > > > > > > > Ye that's the gist. > > > > > > > > rename seqlock is needed to guarantee correctness, otherwise if > > > > > someone renamed a dir as you were resolving the path forward, by the > > > > > time you walk it backwards you may get a path which would not be > > > > > accessible to you -- a result which is not possible with userspace > > > > > realpath. > > > > > > > > In presence of filesystem mutations paths are always unreliable, aren't > > > > they? I mean even with userspace realpath() implementation the moment the > > > > function call is returning the path the filesystem can be modified so that > > > > the path stops being valid. With kernel it is the same. So I don't see any > > > > strong reason to bother with handling parallel filesystem modifications. > > > > But maybe I'm missing some practically important case... > > > > > > > > > > The concern is not that the result is stale, but that it was not > > > legitimately obtainable at any point by the caller doing the current > > > realpath walk. > > > > > > Consider the following tree: > > > /foo/file > > > /bar > > > > > > where foo is 755, bar is 700 and both are owned by root, while the > > > user issuing realpath has some other uid > > > > > > if root renames /foo/file to /bar/file while racing against realpath > > > /foo/file, there is a time window where the user will find the dentry > > > and by the time they d_path the result is /bar/file. but they never > > > would get /bar/file with the current implementation. > > > > > > > That said, if this is considered fine, the entire thing turns a > > trivial patch for sure. > > > > To elaborate, the result is not obtainable in two ways, one of which > has a security angle to it. > > Let's grab the tree again: > /foo/file > /bar > > except both foo and bar are 755 > > with userspace realpath the following results are possible when racing > against rename /foo/file /bar/file: > - success, the path is returned as /foo/file > - ENOENT > > with kernel realpath not checking for rename ENOENT is off the table, > instead you get: > - success, the path is returned as /bar/file > > Suppose that's fine and maybe even considered an improvement. > > Now consider a case where bar is 700 and the user issuing realpath > can't traverse on it. > > The user would *not* get the path even if they realpath /bar/file as > they don't have perms. You could argue the user does not have the > rights to know where the file after said rename. Indeed. In particular if the directory hierarchy is deeper, the user could learn about things he is never supposed to see. Thanks for explanation. > So I stand by the need to check rename seq. > > However, I wonder if it would be tolerable to redo the lookup with the > *new* path and check if we got the same dentry at the end. If so, we > know the path was obtainable by traversal. I agree that if following path lookup will succeed, it should be safe to give out the path. But there's one more corner case - suppose you look up /foo/file while racing with rename /bar/file /foo/file. With userspace implementation you're always going to get /foo/file as rename is atomic wrt other dir operations. With the kernel implementation, you may call prepend_path() on already deleted dentry with obvious results... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR