On Tue, Sep 2, 2025 at 10:33 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote: > > On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote: > > > Dear fs hackers, > > > > > > I suspect there's an unfortunate race window in __fput where file locks are > > > dropped (locks_remove_file) prior to decreasing writer refcount > > > (put_file_access). If I'm not mistaken, this window is observable and it > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained > > > in more detail below. > > > > > > The program demonstrating the problem is attached (a slightly modified version > > > of the demo given by Russ Cox on the Go issue tracker, see URL in first line). > > > It makes 20 threads, each executing an infinite loop doing the following: > > > > > > 1) open an fd for writing with O_CLOEXEC > > > 2) write executable code into it > > > 3) close it > > > 4) fork > > > 5) in the child, attempt to execve the just-written file > > > > > > If you compile it with -DNOWAIT, you'll see that execve often fails with > > > ETXTBSY. > > > > This problem was reported a few times and is quite ancient by now. > > > > While acknowleding the resulting behavior needs to be fixed, I find the > > proposed solutions are merely trying to put more lipstick or a wig on a > > pig. > > > > The age of the problem suggests it is not *urgent* to fix it. > > > > The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in > > all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also > > a part of pig's attire so I think that's the right call. > > > > Not denying execs of files open for writing had to get reverted as > > apparently some software depends on it, so that's a no-go either. > > > > The flag proposed by Christian elsewhere in the thread would sort this > > out, but it's just another hack which would serve no purpose if the > > issue stopped showing up. > > > > The real problem is fork()+execve() combo being crap syscalls with crap > > semantics, perpetuating the unix tradition of screwing you over unless > > you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc > > does not hang out with surprise fds). > > > > While I don't have anything fleshed out nor have any interest in putting > > any work in the area, I would suggest anyone looking to solve the ETXTBSY > > went after the real culprit instead of damage-controlling the current > > API. > > > > To that end, my sketch of a suggestion boils down to a new API which > > allows you to construct a new process one step at a time explicitly > > spelling out resources which are going to get passed on, finally doing > > an actual exec. You would start with getting a file descriptor to a new > > task_struct which you gradually populate and eventually exec something > > on. There would be no forking. > > > > It could look like this (ignore specific naming): > > > > /* get a file descriptor for the new process. there is no *fork* here, > > * but task_struct & related get allocated > > * clean slate, no sigmask bullshit and similar > > */ > > pfd = proc_new(); > > > > nullfd = open("/dev/null", O_RDONLY); > > > > /* map /dev/null as 0/1/2 in the new proc */ > > proc_install_fd(pfd, nullfd, 0); > > proc_install_fd(pfd, nullfd, 2); > > proc_install_fd(pfd, nullfd, 2); > > > > /* if we can run the proc as someone else, set it up here */ > > proc_install_cred(pfd, uid, gid, groups, ...); > > > > proc_set_umask(pfd, ...); > > > > /* finally exec */ > > proc_exec_by_path("/bin/sh", argp, envp); > > You can trivially build this API on top of pidfs. Like: > > pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY) > > where FD_PIDFS_ROOT and FD_INVALID are things we already have in the > uapi headers. > > Then either just add a new system call like pidfd_config() or just use > ioctls() on that empty pidfd. > > With pidfs you have the complete freedom to implement that api however > you want. > > I had a prototype for that as well but I can't find it anymore. Other > VFS work took priority and so I never finished it but I remember it > wasn't very difficult. > > I would definitely merge a patch series like that. > I'm happy we are on the same page here, I'm unhappy it is down to the point of neither of us doing the work though. :-P I don't think the work is difficult from tech standpoint, but it does warrant some caution. With the current model fork + exec model, whatever process-specific bullshit gets added, you automatically get it on fork. Something which gradually spells out the state will need a way to figure out what was not sorted out and fill it up in kernel-side. There maybe some other caveats, basically it would be good if someone(tm) really thought this through. It would be really embarrassing to end up with a deficient & unfixable API which has to remain supported. :-P I offer half of the kingdom for a finished product, the interested party will have to arrange a bride from elsewhere. > > > > Notice how not once at any point random-ass file descriptors popped into > > the new task, which has a side effect of completely avoiding the > > problem. > > > > you may also notice this should be faster to execute as it does not have > > to pay the mm overhead. > > > > While proc_install_fd is spelled out as singular syscalls, this can be > > batched to accept an array of <from, to> pairs etc. > > > > Also notice the thread executing it is not shackled by any of vfork > > limitations. > > > > So... if someone is serious about the transient ETXTBSY, I would really > > hope you will consider solving the source of the problem, even if you > > come up with someting other than I did (hopefully better). It would be a > > damn shame to add even more hacks to pacify this problem (like the O_ > > stuff). > > > > What to do in the meantime? There is a lol hack you can do in userspace > > which so ugly I'm not even going to spell it out, but given the > > temporary nature of ETXTBSY I'm sure you can guess what it is. > > > > Something to ponder, cheers. -- Mateusz Guzik <mjguzik gmail.com>