[RFF] realpathat system call

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Before I explain why the system call and how, I'm noting a significant
limitation upfront: in my proposal the system call is allowed to fail
with EAGAIN. It's not inherent, but I think it's the sane thing to do.
Why I think that's sensible and why it does not defeat the point is
explained later.

Why the system call: realpath(3) is issued a lot for example by gcc
(mostly for header files). libc implements it as a series of
readlinks(!) and it unsurprisingly looks atrocious:
[pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL
(Invalid argument)
[pid 1096382] readlink("/usr/local", 0x7fffbac84f90, 1023) = -1 EINVAL
(Invalid argument)
[pid 1096382] readlink("/usr/local/include", 0x7fffbac84f90, 1023) =
-1 EINVAL (Invalid argument)
[pid 1096382] readlink("/usr/local/include/bits", 0x7fffbac84f90,
1023) = -1 ENOENT (No such file or directory)
[pid 1096382] readlink("/usr", 0x7fffbac84f90, 1023) = -1 EINVAL
(Invalid argument)
[pid 1096382] readlink("/usr/include", 0x7fffbac84f90, 1023) = -1
EINVAL (Invalid argument)
[pid 1096382] readlink("/usr/include/x86_64-linux-gnu",
0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
[pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits",
0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
[pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types",
0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)
[pid 1096382] readlink("/usr/include/x86_64-linux-gnu/bits/types/FILE.h",
0x7fffbac84f90, 1023) = -1 EINVAL (Invalid argument)

and so on. This converts one path lookup to N (by path component). Not
only that's terrible single-threaded, you may also notice all these
lookups bounce lockref-containing cachelines for every path component
in face of gccs running at the same time (and highly parallel
compilations are not rare, are they).

One way to approach this is to construct the new path on the fly. The
problem with that is that it would require some rototoiling and more
importantly is highly error prone (notably due to symlinks). This is
the bit I'm trying to avoid.

A very pleasant way out is to instead walk the path forward, then
backward on the found dentry et voila -- all the complexity is handled
for you. There is however a catch: no forward progress guarantee.

rename seqlock is needed to guarantee correctness, otherwise if
someone renamed a dir as you were resolving the path forward, by the
time you walk it backwards you may get a path which would not be
accessible to you -- a result which is not possible with userspace
realpath.

Locking rename to stabilize it does not solve the problem as by the
time you retry, the needed dentries may be evicted and you may need to
do I/O which you can't do with that lock held. Once you drop it, you
may end up finding it has changed again and you are back to square
one. In principle this can keep happening indefinitely.

So I think the easiest way out is to in fact allow the routine to just
fail after some number of retries, just to eliminate the need for
forward progress guarantee.

This should be perfectly fine as all userspace already has its own
support for realpath. In the worst case they can just fallback to the
current code, transparently to the consumer.

There is a funny bit where the rename check may be failing a lot, to
be massaged.

Any comments?

What follows below is an ugly as sin implementation for reference,
*not* an actual thing I would submit:
/*
 * realpathat system call
 *
 * TODO: note cpu waste from redundant seq checks from lookup and prepend_path
 * TODO: note why there is EAGAIN
 * TODO: retyr without the lock a bunch of times
 */
SYSCALL_DEFINE5(realpathat, int, dfd, const char __user *, name, char
__user *, buf,
               unsigned long, size, int, flags)
{
       struct path path, root;
       struct filename *filename;
       char *page;
       unsigned f_seq, m_seq, r_seq, len;
       int error;

       if (unlikely(flags != 0))
               return -EINVAL;

       page = __getname();
       if (unlikely(!page))
               return -ENOMEM;

       /* error checked in filename_lookup() */
       filename = getname_flags(name, flags);

       f_seq = __read_seqcount_begin(&current->fs->seq);
       m_seq = __read_seqcount_begin(&mount_lock.seqcount);
       r_seq = __read_seqcount_begin(&rename_lock.seqcount);
       smp_rmb();

       /* repeated seq checks inside! */
       error = filename_lookup(dfd, filename, flags, &path, NULL);
       if (error)
               goto out_putname;

       error = -EAGAIN;

       DECLARE_BUFFER(b, page, PATH_MAX);

       rcu_read_lock();
       get_fs_root_rcu(current->fs, &root);
       prepend_char(&b, 0);
       /*
        * XXX what about unhashed entries (d_path?)
        */
       if (unlikely(prepend_path(&path, &root, &b) > 0)) {
               rcu_read_unlock();
               goto out;
       }
       rcu_read_unlock();

       smp_rmb();
       if (__read_seqcount_retry(&rename_lock.seqcount, r_seq) ||
           __read_seqcount_retry(&mount_lock.seqcount, m_seq) ||
           __read_seqcount_retry(&current->fs->seq, f_seq))
               goto out;

       /* copied verbatim from getcwd */
       len = PATH_MAX - b.len;
       if (unlikely(len > PATH_MAX))
               error = -ENAMETOOLONG;
       else if (unlikely(len > size))
               error = -ERANGE;
       else if (copy_to_user(buf, b.buf, len))
               error = -EFAULT;
       else
               error = len;

out:
       path_put(&path);
out_putname:
       putname(filename);
       __putname(page);
       return error;
}

-- 
Mateusz Guzik <mjguzik gmail.com>




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux