Re: ETXTBSY window in __fput

Christian Brauner <brauner@xxxxxxxxxx> · Fri, 29 Aug 2025 13:07:30 +0200

On Fri, Aug 29, 2025 at 01:17:16PM +0300, Alexander Monakov wrote:
> 
> On Fri, 29 Aug 2025, Christian Brauner wrote:
> 
> > On Fri, Aug 29, 2025 at 10:21:35AM +0300, Alexander Monakov wrote:
> > > 
> > > On Wed, 27 Aug 2025, Alexander Monakov wrote:
> > > 
> > > > Dear fs hackers,
> > > > 
> > > > I suspect there's an unfortunate race window in __fput where file locks are
> > > > dropped (locks_remove_file) prior to decreasing writer refcount
> > > > (put_file_access). If I'm not mistaken, this window is observable and it
> > > > breaks a solution to ETXTBSY problem on exec'ing a just-written file, explained
> > > > in more detail below.
> > > 
> > > The race in __fput is a problem irrespective of how the testcase triggers it,
> > > right? It's just showing a real-world scenario. But the issue can be
> > > demonstrated without a multithreaded fork: imagine one process placing an
> > > exclusive lock on a file and writing to it, another process waiting on that
> > > lock and immediately execve'ing when the lock is released.
> > > 
> > > Can put_file_access be moved prior to locks_remove_file in __fput?
> > 
> > Even if we fix this there's no guarantee that the kernel will give that
> > letting the close() of a writably opened file race against a concurrent
> > exec of the same file will not result in EBUSY in some arcane way
> > currently or in the future.
> 
> Forget Go and execve. Take the two-process scenario from my last email.
> The program waiting on flock shouldn't be able to observe elevated
> refcounts on the file after the lock is released. It matters not only
> for execve, but also for unmounting the underlying filesystem, right?

What? No. How?: with details, please.

> And maybe other things too. So why not fix the ordering issue in __fput
> and if there are other bugs breaking valid uses of flock, fix them too?

For locks_remove_file() to be movable after put_file_access() we'd have
to prove that no filesystem implementing f_op->lock() doesn't rely on
f_op->release() to not have run. It is fundamentally backwards to have
run f_ops after f_op->release() ran.

Random quick look into 9pfs:

static int v9fs_file_do_lock(struct file *filp, int cmd, struct file_lock *fl)
{
	struct p9_flock flock;
	struct p9_fid *fid;
	uint8_t status = P9_LOCK_ERROR;
	int res = 0;
	struct v9fs_session_info *v9ses;

	fid = filp->private_data;
	BUG_ON(fid == NULL);

This relies on filp->private_data to be valid which it wouldn't be
anymore after f_op->release().

Moving put_file_access() before f_op->release() is also wrong and would
require to prove that no filesystem depends on file access not having
changed before f_op->release() has run. So no, not a trivial thing to
move around.

And you are explicitly advertising this as a fix to the go execve
problem; both in the bugtracker and here. And it's simply not a good
solution. The problem remains exec's deny-write mechanism. But hooking
into file locks to serialize exec against writable openers isn't a good
solution. It surely is creative though.

We can give userspace a simple way to sidestep this problem though as I
tried to show.