[WIP RFC PATCH v2 00/10] i_state accessors + I_WILL_FREE removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



NOTE: this is a WIP not meant to be included anywhere yet and perhaps
should be split into 2 patchsets.

It is generated against against:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6.18.inode.refcount.preliminaries

The first patch in the series is ready to use(tm) and was sent
separately here:
https://lore.kernel.org/linux-fsdevel/20250909082613.1296550-1-mjguzik@xxxxxxxxx/T/#u

It is included in this posting as the newly patched routine has to get
patched further due to the i_state accessor thing. Having it here should
make it handier to test for interested.

This is a cleaned up continuation of the churn-ey patch which merely removed I_WILL_FREE:
https://lore.kernel.org/linux-fsdevel/20250902145428.456510-1-mjguzik@xxxxxxxxx/

The entire thing is a response to the patchset by Josef Bacik concerning
refcount changes, see:
https://lore.kernel.org/linux-fsdevel/cover.1756222464.git.josef@xxxxxxxxxxxxxx/

I'm writing my second reply to that patchset, but in the meantime the
stuff below should facilitate work forward, regardless if the refcount
patchset goes in or not.

The patchset splits churn from actual changes.

Plain ->i_state access is still possible to reduce upfront churn, only
some of the tree got covered so far.

short rundown:
  fs: hide ->i_state handling behind accessors

This is churn-ey and should largely be a nop, worst case there will be
failed lock assertions if I messed up some annotations and which should
be easy to sort out. It covers fs/*.c and friends, but no filesystems.

  bcachefs: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  ext4: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  ocfs2: use the new ->i_state accessors

This patches only the filesystems which reference I_WILL_FREE. Again
should be a nop.

  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage

Actual change, needs ocfs2 folks approval.

  fs: set I_FREEING instead of I_WILL_FREE in iput_final() prior to
    writeback

Actual change. I_WILL_FREE still exists as a macro but is no longer set
by anything. As of right now I'm not fully confident this is correct,
the writeback code is more fuck-ey than not. However, figuring this out
is imo a hard prerequisite for the refcount patchset by Josef anyway.

  fs: retire the I_WILL_FREE flag

Churn change to whack the flag. It also uses the opportunity to do some
cosmetics.

Onto the rationale:

I. i_state

The handling in the stock kernel is highly error prone and with 0 assert
coverage.

Notably there is nothing guaranteeing the caller owns the necessary lock
when making changes. But apart from that there are spots which look at
->i_state several times and it is unclear if they think it is stable or
not. Moreover spots used on inode teardown use WRITE_ONCE, some spots in
hash lookup use READ_ONCE, but everyone else issues plain loads and
stores which invites compiler mischief.

All of this is easily preventable.

The ideal state as I see it would also hide the field behind a struct so
that plain open-coded accesses fail to compile. Not done yet to reduce
churn.

Another step not taken here but to be sorted out later is strict
handling of flags, where it is illegal to clear flags which are not
present and to set flags which are already set -- the kernel should know
whether a given flag can legally be present or not (trivial example:
I_FREEING).  Setting a flag which is already set is more likely to be a
logic error than not. If it turns out there are cases where a flag can
be legally already present/missing, an additional helper can be added to
forego the assertion.

Practical examples:
	spin_lock(&inode->i_lock);
	if (inode_state_read(inode) & I_WHATEVER) {
		....
	}
	spin_unlock(&inode->i_lock);

This asserts the lock is held.

But if the caller is looking to do a lockless check first, they can do
it and explicitly denote this is what they want:

	if (inode_state_read_unlocked(inode) & I_WHATEVER) {
		spin_lock(&inode->i_lock);
		if (inode_state_read(inode) & I_WHATEVER) {
			....
		}
		spin_unlock(&inode->i_lock);
	}

Similarly:
	state = inode_state_read_unlocked(inode);
	if (state & I_CRAP) {
	} else (state & I_MEH) {
	}
	...

We are guaranteed no mischief and the caller acknowledges the value in
the inode could have changed from under them and the code is READ_ONCE
(as opposed to plain ->i_state loads now).

Furthermore, should better lifecycle tracking get introduced, the
helpers can validate no flags get added when it is invalid to do so.

The *current* routines are as below. I don't care about specific
names, I do care about semantics.

/*
 * i_state handling
 *
 * We hide all of it behind helpers so that we can validate consumers.
 */
static inline enum inode_state_flags_enum inode_state_read(struct inode *inode)
{
        lockdep_assert_held(&inode->i_lock);
        return inode->i_state;
}

static inline enum inode_state_flags_enum inode_state_read_unlocked(struct inode *inode)
{
        return READ_ONCE(inode->i_state);
}

static inline void inode_state_add(struct inode *inode,
                                         enum inode_state_flags_enum newflags)
{
        lockdep_assert_held(&inode->i_lock);
        WRITE_ONCE(inode->i_state, inode->i_state | newflags);
}

static inline void inode_state_del(struct inode *inode,
                                         enum inode_state_flags_enum rmflags)
{
        lockdep_assert_held(&inode->i_lock);
        WRITE_ONCE(inode->i_state, inode->i_state & ~rmflags);
}

static inline void inode_state_set_unchecked(struct inode *inode,
                                                   enum inode_state_flags_enum newflags)
{
        WRITE_ONCE(inode->i_state, newflags);
}

The inode_state_set_unchecked() crapper is there to handle early access
during inode construction (before it lands in the hash).

II. I_WILL_FREE removal

Sounds like nobody likes this flag and even the developer documenting it
in fs.h was not able to provide a justification for its existence,
merely stating how it is used.

As far as I can tell the only use was to allow ->drop_inode() handlers
to drop ->i_lock and still prevent anyone from picking up the inode. I
*suspect* this was used instead of I_FREEING because the routine could
have decided to *not* drop afterwards. Differentiating between
indicating the inode is going down vs just telling the consumer to
bugger off for the time being seemed like an ok idea.

However, the only filesystem using today is ocfs2, it always returns
"drop it" and this usage does not even have to be there. Removed in one
of the patches.

Apart from that the only use was write_inode_now() call in iput_final()
prior to setting I_FREEING anyway. This probably works as posted here,
but there might be some fuckery to sort out in writeback to truly
eliminate the flag. In the worst case it's just some work, but *so far*
I'm not staking anything on the patchset being fully correct yet.

tl;dr the flag does not have to be there, but there may be dragons in
writeback (to be seen). No matter what, shaking bugs out of this should
be considered a pre-requisite for any future work regarding inode
lifecycle (whether the refcount patchset lands or not, imo it should not
which I'll elaborate on later in that thread).

Apart from that the I_CREATING flag seems to have inconsistent handling,
but that's for another e-mail after I get a better hang of it.

So.. comments?

Mateusz Guzik (10):
  fs: expand dump_inode()
  fs: hide ->i_state handling behind accessors
  bcachefs: use the new ->i_state accessors
  btrfs: use the new ->i_state accessors
  ext4: use the new ->i_state accessors
  gfs2: use the new ->i_state accessors
  ocfs2: use the new ->i_state accessors
  ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
  fs: set I_FREEING instead of I_WILL_FREE in iput_final() prior to
    writeback
  fs: retire the I_WILL_FREE flag

 block/bdev.c                     |   4 +-
 fs/bcachefs/fs.c                 |   8 +-
 fs/btrfs/inode.c                 |  10 +--
 fs/buffer.c                      |   4 +-
 fs/crypto/keyring.c              |   2 +-
 fs/crypto/keysetup.c             |   2 +-
 fs/dcache.c                      |   8 +-
 fs/drop_caches.c                 |   2 +-
 fs/ext4/inode.c                  |  10 +--
 fs/ext4/orphan.c                 |   4 +-
 fs/fs-writeback.c                | 131 +++++++++++++++----------------
 fs/gfs2/file.c                   |   2 +-
 fs/gfs2/glops.c                  |   2 +-
 fs/gfs2/inode.c                  |   4 +-
 fs/gfs2/ops_fstype.c             |   2 +-
 fs/inode.c                       | 115 ++++++++++++++-------------
 fs/libfs.c                       |   6 +-
 fs/namei.c                       |   8 +-
 fs/notify/fsnotify.c             |   8 +-
 fs/ocfs2/dlmglue.c               |   2 +-
 fs/ocfs2/inode.c                 |  27 +------
 fs/ocfs2/inode.h                 |   1 -
 fs/ocfs2/ocfs2_trace.h           |   2 -
 fs/ocfs2/super.c                 |   2 +-
 fs/pipe.c                        |   2 +-
 fs/quota/dquot.c                 |   2 +-
 fs/sync.c                        |   2 +-
 fs/xfs/scrub/common.c            |   3 +-
 include/linux/backing-dev.h      |   5 +-
 include/linux/fs.h               |  75 ++++++++++++------
 include/linux/writeback.h        |   4 +-
 include/trace/events/writeback.h |  11 ++-
 security/landlock/fs.c           |  12 +--
 33 files changed, 249 insertions(+), 233 deletions(-)

-- 
2.43.0





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux