Re: [PATCH v8 0/2] fuse: remove temp page copies in writeback

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Mon, 14 Apr 2025 15:47:09 -0700

On Mon, Apr 14, 2025 at 03:22:08PM -0700, Joanne Koong wrote:
> The purpose of this patchset is to help make writeback in FUSE filesystems as
> fast as possible.
> 
> In the current FUSE writeback design (see commit 3be5a52b30aa
> ("fuse: support writable mmap"))), a temp page is allocated for every dirty
> page to be written back, the contents of the dirty page are copied over to the
> temp page, and the temp page gets handed to the server to write back. This is
> done so that writeback may be immediately cleared on the dirty page, and this 
> in turn is done in order to mitigate the following deadlock scenario that may
> arise if reclaim waits on writeback on the dirty page to complete (more
> details
> can be found in this thread [1]):
> * single-threaded FUSE server is in the middle of handling a request
>   that needs a memory allocation
> * memory allocation triggers direct reclaim
> * direct reclaim waits on a folio under writeback
> * the FUSE server can't write back the folio since it's stuck in
>   direct reclaim
> 
> Allocating and copying dirty pages to temp pages is the biggest performance
> bottleneck for FUSE writeback. This patchset aims to get rid of the temp page
> altogether (which will also allow us to get rid of the internal FUSE rb tree
> that is needed to keep track of writeback status on the temp pages).
> Benchmarks show approximately a 20% improvement in throughput for 4k
> block-size writes and a 45% improvement for 1M block-size writes.
> 
> In the current reclaim code, there is one scenario where writeback is waited
> on, which is the case where the system is running legacy cgroupv1 and reclaim
> encounters a folio that already has the reclaim flag set and the caller did
> not have __GFP_FS (or __GFP_IO if swap) set.
> 
> This patchset adds a new mapping flag, AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM,
> which filesystems may set on its inode mappings to indicate that reclaim
> should not wait on writeback. FUSE will set this flag on its mappings. Reclaim
> for the legacy cgroup v1 case described above will skip reclaim of folios with
> that flag set. With this flag set, now FUSE can remove temp pages altogether.
> 
> With this change, writeback state is now only cleared on the dirty page after
> the server has written it back to disk. If the server is deliberately
> malicious or well-intentioned but buggy, this may stall sync(2) and page
> migration, but for sync(2), a malicious server may already stall this by not
> replying to the FUSE_SYNCFS request and for page migration, there are already
> many easier ways to stall this by having FUSE permanently hold the folio lock.
> A fuller discussion on this can be found in [2]. Long-term, there needs to be
> a more comprehensive solution for addressing migration of FUSE pages that
> handles all scenarios where FUSE may permanently hold the lock, but that is
> outside the scope of this patchset and will be done as future work. Please
> also note that this change also now ensures that when sync(2) returns, FUSE
> filesystems will have persisted writeback changes.
> 
> For this patchset, it would be ideal if the first patch could be taken by
> Andrew to the mm tree and the second patch could be taken by Miklos into the
> fuse tree, as the fuse large folios patchset [3] depends on the second patch.

Why not take both patches through FUSE tree? Second patch has dependency
on first patch, so there is no need to keep them separate.