Re: [PATCH v7 0/3] fuse: remove temp page copies in writeback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 14, 2025 at 9:21 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Fri, 2025-04-04 at 11:14 -0700, Joanne Koong wrote:
> > The purpose of this patchset is to help make writeback in FUSE filesystems as
> > fast as possible.
> >
> > In the current FUSE writeback design (see commit 3be5a52b30aa
> > ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> > page to be written back, the contents of the dirty page are copied over to the
> > temp page, and the temp page gets handed to the server to write back. This is
> > done so that writeback may be immediately cleared on the dirty page, and this
> > in turn is done in order to mitigate the following deadlock scenario that may
> > arise if reclaim waits on writeback on the dirty page to complete (more details
> > can be found in this thread [1]):
> > * single-threaded FUSE server is in the middle of handling a request
> >   that needs a memory allocation
> > * memory allocation triggers direct reclaim
> > * direct reclaim waits on a folio under writeback
> > * the FUSE server can't write back the folio since it's stuck in
> >   direct reclaim
> >
> > Allocating and copying dirty pages to temp pages is the biggest performance
> > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> > altogether (which will also allow us to get rid of the internal FUSE rb tree
> > that is needed to keep track of writeback status on the temp pages).
> > Benchmarks show approximately a 20% improvement in throughput for 4k
> > block-size writes and a 45% improvement for 1M block-size writes.
> >
> > In the current reclaim code, there is one scenario where writeback is waited
> > on, which is the case where the system is running legacy cgroupv1 and reclaim
> > encounters a folio that already has the reclaim flag set and the caller did
> > not have __GFP_FS (or __GFP_IO if swap) set.
> >
> > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
> > filesystems may set on its inode mappings to indicate that writeback
> > operations may take an indeterminate amount of time to complete. FUSE will set
> > this flag on its mappings. Reclaim for the legacy cgroup v1 case described
> > above will skip reclaim of folios with that flag set.
> >
> > With this change, writeback state is now only cleared on the dirty page after
> > the server has written it back to disk. If the server is deliberately
> > malicious or well-intentioned but buggy, this may stall sync(2) and page
> > migration, but for sync(2), a malicious server may already stall this by not
> > replying to the FUSE_SYNCFS request and for page migration, there are already
> > many easier ways to stall this by having FUSE permanently hold the folio lock.
> > A fuller discussion on this can be found in [2]. Long-term, there needs to be
> > a more comprehensive solution for addressing migration of FUSE pages that
> > handles all scenarios where FUSE may permanently hold the lock, but that is
> > outside the scope of this patchset and will be done as future work. Please
> > also note that this change also now ensures that when sync(2) returns, FUSE
> > filesystems will have persisted writeback changes.
> >
> > [1] https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@xxxxxxxxxxxxxxxxx/
> > [2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@xxxxxxxxx/
> >
> > Changelog
> > ---------
> > v6:
> > https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@xxxxxxxxx/
> > Changes from v6 -> v7:
> > * Drop migration and sync patches, as they are useless if a server is
> >   determined to be malicious
> >
> > v5:
> > https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@xxxxxxxxx/
> > Changes from v5 -> v6:
> > * Add Shakeel and Jingbo's reviewed-bys
> > * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo)
> > * Embed fuse_writepage_finish_stat() logic inline (Jingbo)
> > * Remove node_stat NR_WRITEBACK inc/sub (Jingbo)
> >
> > v4:
> > https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@xxxxxxxxx/
> > Changes from v4 -> v5:
> > * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel)
> > * Drop memory hotplug patch (David and Shakeel)
> > * Remove some more kunnecessary writeback waits in fuse code (Jingbo)
> > * Make commit message for reclaim patch more concise - drop part about
> >   deadlock and just focus on how it may stall waits
> >
> > v3:
> > https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@xxxxxxxxx/
> > Changes from v3 -> v4:
> > * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in
> >   readahead
> >
> > v2:
> > https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@xxxxxxxxx/
> > Changes from v2 -> v3:
> > * Account for sync and page migration cases as well (Miklos)
> > * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK
> > * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache
> >   is enabled
> >
> > v1:
> > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/T/#t
> > Changes from v1 -> v2:
> > * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel)
> > * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel)
> >
> > Joanne Koong (3):
> >   mm: add AS_WRITEBACK_INDETERMINATE mapping flag
> >   mm: skip reclaiming folios in legacy memcg writeback indeterminate
> >     contexts
> >   fuse: remove tmp folio for writebacks and internal rb tree
> >
> >  fs/fuse/file.c          | 360 ++++------------------------------------
> >  fs/fuse/fuse_i.h        |   3 -
> >  include/linux/pagemap.h |  11 ++
> >  mm/vmscan.c             |  10 +-
> >  4 files changed, 46 insertions(+), 338 deletions(-)
> >
>
> This looks sane, and I love that diffstat.
>
> I also agree with David about changing the flag name to something more
> specific. As a kernel engineer, anything with "INDETERMINATE" in the
> name gives me the ick.
>
> Assuming that the only real change in v8 will be the flag name change,
> you can add:
>
> Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
>
> Assuming others are ok with this, how do you see this going in? Maybe
> Andrew could pick up the mm bits and Miklos could take the FUSE patch?


Thanks for the review. The only thing I plan to change for v8 is the
flag name and removing the unneeded fuse_sync_writes() call in
fuse_flush() that Jingbo pointed out.

With v8, I'm hoping the mm bits (first 2 patches) could be picked up
by Andrew and that the 3rd patch (the one with FUSE changes) could be
taken by Miklos, as the FUSE large folios patchset [1] I will be
resending will depend on patch 3.

Thanks,
Joanne

 [1] https://lore.kernel.org/linux-fsdevel/20241213221818.322371-1-joannelkoong@xxxxxxxxx/





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux