On Mon, Apr 14, 2025 at 04:36:58PM -0700, Joanne Koong wrote: > On Mon, Apr 14, 2025 at 3:47 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > On Mon, Apr 14, 2025 at 03:22:08PM -0700, Joanne Koong wrote: > > > The purpose of this patchset is to help make writeback in FUSE filesystems as > > > fast as possible. > > > > > > In the current FUSE writeback design (see commit 3be5a52b30aa > > > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > > > page to be written back, the contents of the dirty page are copied over to the > > > temp page, and the temp page gets handed to the server to write back. This is > > > done so that writeback may be immediately cleared on the dirty page, and this > > > in turn is done in order to mitigate the following deadlock scenario that may > > > arise if reclaim waits on writeback on the dirty page to complete (more > > > details > > > can be found in this thread [1]): > > > * single-threaded FUSE server is in the middle of handling a request > > > that needs a memory allocation > > > * memory allocation triggers direct reclaim > > > * direct reclaim waits on a folio under writeback > > > * the FUSE server can't write back the folio since it's stuck in > > > direct reclaim > > > > > > Allocating and copying dirty pages to temp pages is the biggest performance > > > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > > > altogether (which will also allow us to get rid of the internal FUSE rb tree > > > that is needed to keep track of writeback status on the temp pages). > > > Benchmarks show approximately a 20% improvement in throughput for 4k > > > block-size writes and a 45% improvement for 1M block-size writes. > > > > > > In the current reclaim code, there is one scenario where writeback is waited > > > on, which is the case where the system is running legacy cgroupv1 and reclaim > > > encounters a folio that already has the reclaim flag set and the caller did > > > not have __GFP_FS (or __GFP_IO if swap) set. > > > > > > This patchset adds a new mapping flag, AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, > > > which filesystems may set on its inode mappings to indicate that reclaim > > > should not wait on writeback. FUSE will set this flag on its mappings. Reclaim > > > for the legacy cgroup v1 case described above will skip reclaim of folios with > > > that flag set. With this flag set, now FUSE can remove temp pages altogether. > > > > > > With this change, writeback state is now only cleared on the dirty page after > > > the server has written it back to disk. If the server is deliberately > > > malicious or well-intentioned but buggy, this may stall sync(2) and page > > > migration, but for sync(2), a malicious server may already stall this by not > > > replying to the FUSE_SYNCFS request and for page migration, there are already > > > many easier ways to stall this by having FUSE permanently hold the folio lock. > > > A fuller discussion on this can be found in [2]. Long-term, there needs to be > > > a more comprehensive solution for addressing migration of FUSE pages that > > > handles all scenarios where FUSE may permanently hold the lock, but that is > > > outside the scope of this patchset and will be done as future work. Please > > > also note that this change also now ensures that when sync(2) returns, FUSE > > > filesystems will have persisted writeback changes. > > > > > > For this patchset, it would be ideal if the first patch could be taken by > > > Andrew to the mm tree and the second patch could be taken by Miklos into the > > > fuse tree, as the fuse large folios patchset [3] depends on the second patch. > > > > Why not take both patches through FUSE tree? Second patch has dependency > > on first patch, so there is no need to keep them separate. > > If that's possible, that sounds great to me too. The patchset went > through Andrew's mm tree last time, so I'm not sure if the protocol is > that any/all mm changes need to go through Andrew's tree. This series can go through mm tree or fuse tree but it seems like you plan to do a followup fuse work which requires this series. I would suggest to go through fuse tree. Just let Andrew know and he is mostly fine with it.