Re: ETXTBSY window in __fput

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 2 Sep 2025 10:33:15 +0200

On Mon, Sep 01, 2025 at 08:39:27PM +0200, Mateusz Guzik wrote:
> On Wed, Aug 27, 2025 at 12:05:38AM +0300, Alexander Monakov wrote:
> > Dear fs hackers,
> > 
> > I suspect there's an unfortunate race window in __fput where file locks are
> > dropped (locks_remove_file) prior to decreasing writer refcount
> > (put_file_access). If I'm not mistaken, this window is observable and it
> > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > in more detail below.
> > 
> > The program demonstrating the problem is attached (a slightly modified version
> > of the demo given by Russ Cox on the Go issue tracker, see URL in first line).
> > It makes 20 threads, each executing an infinite loop doing the following:
> > 
> > 1) open an fd for writing with O_CLOEXEC
> > 2) write executable code into it
> > 3) close it
> > 4) fork
> > 5) in the child, attempt to execve the just-written file
> > 
> > If you compile it with -DNOWAIT, you'll see that execve often fails with
> > ETXTBSY.
> 
> This problem was reported a few times and is quite ancient by now.
> 
> While acknowleding the resulting behavior needs to be fixed, I find the
> proposed solutions are merely trying to put more lipstick or a wig on a
> pig.
> 
> The age of the problem suggests it is not *urgent* to fix it.
> 
> The O_CLOFORM idea was accepted into POSIX and recent-ish implemented in
> all the BSDs (no, really) and illumos, but got NAKed in Linux. It's also
> a part of pig's attire so I think that's the right call.
> 
> Not denying execs of files open for writing had to get reverted as
> apparently some software depends on it, so that's a no-go either.
> 
> The flag proposed by Christian elsewhere in the thread would sort this
> out, but it's just another hack which would serve no purpose if the
> issue stopped showing up.
> 
> The real problem is fork()+execve() combo being crap syscalls with crap
> semantics, perpetuating the unix tradition of screwing you over unless
> you explicitly ask it not to (e.g., with O_CLOEXEC so that the new proc
> does not hang out with surprise fds).
> 
> While I don't have anything fleshed out nor have any interest in putting
> any work in the area, I would suggest anyone looking to solve the ETXTBSY
> went after the real culprit instead of damage-controlling the current
> API.
> 
> To that end, my sketch of a suggestion boils down to a new API which
> allows you to construct a new process one step at a time explicitly
> spelling out resources which are going to get passed on, finally doing
> an actual exec. You would start with getting a file descriptor to a new
> task_struct which you gradually populate and eventually exec something
> on. There would be no forking.
> 
> It could look like this (ignore specific naming):
> 
> /* get a file descriptor for the new process. there is no *fork* here,
>  * but task_struct & related get allocated
>  * clean slate, no sigmask bullshit and similar
>  */
> pfd = proc_new();
> 
> nullfd = open("/dev/null", O_RDONLY);
> 
> /* map /dev/null as 0/1/2 in the new proc */
> proc_install_fd(pfd, nullfd, 0); 
> proc_install_fd(pfd, nullfd, 2); 
> proc_install_fd(pfd, nullfd, 2); 
> 
> /* if we can run the proc as someone else, set it up here */
> proc_install_cred(pfd, uid, gid, groups, ...);
> 
> proc_set_umask(pfd, ...);
> 
> /* finally exec */
> proc_exec_by_path("/bin/sh", argp, envp);

You can trivially build this API on top of pidfs. Like:

pidfd_empty = pidfd_open(FD_PIDFS_ROOT/FD_INVALID, PIDFD_EMPTY)

where FD_PIDFS_ROOT and FD_INVALID are things we already have in the
uapi headers.

Then either just add a new system call like pidfd_config() or just use
ioctls() on that empty pidfd.

With pidfs you have the complete freedom to implement that api however
you want.

I had a prototype for that as well but I can't find it anymore. Other
VFS work took priority and so I never finished it but I remember it
wasn't very difficult.

I would definitely merge a patch series like that.

> 
> Notice how not once at any point random-ass file descriptors popped into
> the new task, which has a side effect of completely avoiding the
> problem.
> 
> you may also notice this should be faster to execute as it does not have
> to pay the mm overhead.
> 
> While proc_install_fd is spelled out as singular syscalls, this can be
> batched to accept an array of <from, to> pairs etc.
> 
> Also notice the thread executing it is not shackled by any of vfork
> limitations.
> 
> So... if someone is serious about the transient ETXTBSY, I would really
> hope you will consider solving the source of the problem, even if you
> come up with someting other than I did (hopefully better). It would be a
> damn shame to add even more hacks to pacify this problem (like the O_
> stuff).
> 
> What to do in the meantime? There is a lol hack you can do in userspace
> which so ugly I'm not even going to spell it out, but given the
> temporary nature of ETXTBSY I'm sure you can guess what it is.
> 
> Something to ponder, cheers.